***
**Author:** Josiah Wallis \
Created for use in CS/STAT108: Data Science Ethics (UCR - Winter 2024)
***

# Data Preprocessing and Visualization
This week, we'll discuss how to use built-in `numpy` and `pandas` functionality to do some data cleaning, preprocessing, and plotting. We'll see examples of how to use two fundamental plotting libraries: `matplotlib` and `seaborn`. This will give you the fundamentals to start your first integrative assignment!\
\
Before running the code, please download the **[Global Temperature Records](https://www.kaggle.com/datasets/maso0dahmed/global-temperature-records-1850-2022?resource=download)** dataset, upload it to your google colab environment, and rename it `data.csv`.

# Data Preprocessing
`pandas` has some great built-in tools for basic data preprocessing like removing or filling null values. Let's see some below!

In [None]:
# Import libraries
import numpy as np
import pandas as pd

In [None]:
# Load data
data = pd.read_csv('data.csv')
data

## Locating and filtering na/null values

isna() and any()

In [None]:
data.isna()

In [None]:
data.isna().any()

In [None]:
# Grab data points with na values
points_na, column_idxs = np.where(data.isna())
print(f'Datapoints with null values: {points_na[:3]}\nNull columns: {list(data.columns[column_idxs[:3]])}')

In [None]:
# What percent of our data has na values?
points_na_uniq = np.unique(points_na)
num_null_points = len(points_na_uniq)
print(f'Number of points with null values: {num_null_points}\nPercent of data with null values: {num_null_points / data.shape[0]:.2%}')

In [None]:
data.loc[points_na_uniq[:3]]

In [None]:
# Find number of na values per feature
na_mask = data.isna().any()
columns_w_na = data.columns[na_mask]
for col in columns_w_na:
  num_na = data[col].isna().sum()
  print(f'{col} na count: {num_na}')

In [None]:
# dataset without samples with na values
data_dropna = data.dropna().copy()
data_dropna.shape

In [None]:
len(data) - len(data_dropna) == len(points_na_uniq)

fillna()

In [None]:
# New dataset with only numerical values
data_num = data.select_dtypes(include = ['number']).copy()
data_num

In [None]:
data_num.mean()

In [None]:
data_num = data_num.fillna(data_num.mean())
data_num.loc[points_na_uniq]

`pandas` has many data transformation functions/methods like `DataFrame.map` and `DataFrame.replace`. If you think there's away to clean/preprocess, there is probably a method or function for it. Check the documentation or google! We'll see some useful functionality as needed.

***
## Data Visualization


In [None]:
# Import libraries
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
data.head()

### `pandas` plotting

In [None]:
df = pd.DataFrame(data, columns = ['AverageTemperature', 'Country'])
df.head()

In [None]:
df['AverageTemperature'].hist()
plt.show()

In [None]:
df['Country'].value_counts().plot.pie(figsize = (18, 12))
plt.show()

### `plt` plotting

In [None]:
# Plot sin(x)
x = np.linspace(-4, 4, 100)
y = np.sin(x)

plt.plot(x, y, color = 'green', linestyle = '--')
plt.axhline(0, color = 'black')
plt.axvline(0, color = 'black')

plt.title('sin(x)')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.show()

In [None]:
# Subplots
cool_func = lambda x: np.abs(np.sin(x ** x) / (2 ** ((x ** x - np.pi/2)/np.pi)))
func_list = [np.cos, lambda x: x ** 2, lambda x: np.abs(x), cool_func]
func_names = [r'$\cos{x}$', r'$x^2$', r'$|x|$', r'$|\frac{\sin{x^x}}{2^{(x^x-\pi / 2) / \pi}}|$']

plt.figure(figsize = (10, 10))
for i, f in enumerate(func_list):
  plt.subplot(2, 2, i + 1)
  plt.plot(x, f(x))
  plt.axhline(0, color = 'black')
  plt.axvline(0, color = 'black')
  plt.title(func_names[i])
  plt.grid()

plt.show()

Combining `plt` with `pandas` and `seaborn`

In [None]:
# Using plt with pandas plotting
df['AverageTemperature'].hist(bins = 50)
plt.title('Avg Temp Histogram')
plt.xlabel('Temperature')
plt.ylabel('Count')
plt.show()

In [None]:
# Seaborn and plt
sns.displot(df['AverageTemperature'], kde = True, bins = 50, color = 'purple')
plt.title('Density Estimation')
plt.xlabel('Temperature')
plt.ylabel('Counts')
plt.grid()
plt.show()

In [None]:
# Only density plot
sns.displot(df['AverageTemperature'], kind = 'kde')
plt.title('Density Estimation')
plt.xlabel('Temperature')
plt.ylabel(r'f(x)')
plt.grid()
plt.show()

In [None]:
# Grab subset of data
countries = np.unique(df['Country'])
df_3countries = df[df['Country'].isin(countries[:3])] # df['Country'] == countries[i] for i in 0:3
np.unique(df_3countries['Country'])

In [None]:
# Separate density curves on one graph
sns.displot(df_3countries, x = 'AverageTemperature', hue = 'Country', kind = 'kde')
plt.title('Density Estimation')
plt.xlabel('Temperature')
plt.ylabel('Counts')
plt.grid()
plt.show()

In [None]:
# Barplots with sns
colors = ['red', 'green', 'blue']
sns.barplot(df_3countries['Country'].value_counts(), palette = colors)
plt.title('3-Country Barplot')
plt.xlabel('Countries')
plt.ylabel('Count')
plt.show()