# Chapter 1

### Feature Engineering

- Create new features (eg: averaging, BMI etc )
- Visualize distribution with boxplot, pairplot of dataset to see if Transformation is necessary (eg: log transformation)
- Normalize/Standardize/Scale features
- Encoding : Convert categories into numeric data
    - One-hot encoding : Explainable features, create N columns for N categories
    - Dummy encoding : Necessary information without duplication, create N-1 columns for N categories
- Merge low frequent categorical values (uncommon categories) into one single category (eg: `other`)
- Binarise numeric values (eg: from `num_violations` to `violation_boolean`)
- Deal with missing values:
    - drop missing values that are beyond threshold (>30% of dataset)
    - fill completely random missing values (with mean, median, mode, `Other`, sorted next present value)
- Deal with outliers
- Validate numeric columns
    - remove characters from numeric data (eg: `$` or `,` sign for currency)
    - make sure the column is in proper datatype (eg: `float`, `int` etc)


```
# Visualize distribution
sns.pairplot(df)
df[['column_1']].boxplot()
plt.show()

# One-hot encoding
pd.get_dummies(df, columns=['cat'], prefix='C')
# Dummy encoding
pd.get_dummies(df, columns=['cat'], drop_first=True, prefix='C')

# Merging low frequency categorical counts
counts = df['cat'].value_counts()
mask = df['cat'].isin(counts[counts < 5].index) 
df['cat'][mask] = 'Other'

# Binarizing numeric variables
df['Binary_col'] = 0 
df.loc[df['Number_col'] > 0, 'Binary_col'] = 1
import numpy as np
df['Binned_Group'] = pd.cut( df['Number_col'], bins=[-np.inf, 0, 2, np.inf], labels=[1, 2, 3])

# SCALE / STANDARDIZE DATA
# DEAL WITH MISSING VALUES.....
# DEAL WITH OUTLIERS

# Validate numeric columns
df['RawSalary'] = df['RawSalary'].str.replace(',', '').astype('float')
coerced_vals = pd.to_numeric(df['RawSalary'], errors='coerce')
print(df[coerced_vals.isna()].head()) # Sanity check which values still show errors
```

### Binning

```
# create intervals for equal-sized 5 bins
bins = np.linspace(df["price"].min(), df["price"].max(),5)
custom_labels = ["low","medium","high"]
df["price_bin"] = pd.cut(df["price"], bins, labels=custom_labels, include_lowest=True)

# Alternative approach
df['price_bin'] = pd.qcut(df['price'], q=3)
```

### Dataframe column

```
# See column names
df.columns
# Set column names
df.columns = ['A', 'B', 'C']
# Data type of columns
df.dtypes
# Select column of specific types only
df_ints = df.select_dtypes(include=['int'])
# Set type of a column
df['num_col']=df['num_col'].astype(int)
# See column description
df.describe()
# See column information
df.info()
# See frequencies in categorical column
df['cat'].value_counts()
```

### One hot encoding

```
# One-hot-encoding on categorical variable
df_onehot = pd.get_dummies(df, columns=['cat'], prefix='C')
df_dummy = pd.get_dummies(df, columns=['cat'], drop_first=True, prefix='C')

# Alternative approach-2
from sklearn import preprocessing
encoder = preprocessing.OneHotEncoder()
onehot_transformed = encoder.fit_transform(df['cat_col'].values.reshape(-1,1))
# Convert into dataframe
onehot_df = pd.DataFrame(onehot_transformed.toarray())
# Add the encoded columns with original dataset, 
df = pd.concat([df, onehot_df], axis=1)
# Drop the original column that you used for encoding 
df = df.drop('cat_col', axis=1)

# Label encoding : Turning string labels into numeric values
from sklearn import preprocessing
encoder_lvl = preprocessing.LabelEncoder()
# Specify the unique categories in the column to apply one-hot encoding
encoder_lvl.fit([ 'LOW', 'NORMAL', 'HIGH'])
# Apply one hot encoding on the third column of the dataset
df[:,2] = encoder_lvl.transform(df[:,2]) 
```

# Chapter 2

### Deal with Missing values

```
# Show number of missing data
df.isna().sum()

# Visualize missing data information
import missingno as msno
import matplotlib.pyplot as plt
msno.matrix(airquality)
plt.show()

# Drop missing data
df_dropped = df.dropna(subset = ['col'])

# Replace/impute missing data with single value
col_mean = df['col'].mean()
df_imputed = df.fillna({'col': col_mean})

# Replace/impute missing data with series
series_imp = df['col1'] * 5
df_imputed = df.fillna({'col2':series_imp})

# Missing values are not always "NaN". They can be blank, "?" or other symbols (rarely)
# Check for values through manual validations first
df["col"].value_counts() # Look out for suspicious values
# Determine number of missing values in a column
df.isna().any()
df['col'].isnull().sum()
# Drop missing values
df.dropna(axis = 0) # Drop entire row for missing value (default)
df.dropna(axis = 1) # Drop entire column for missing value
# Drop missing values for specific column
df.dropna(subset = ["col"], axis = 0)
# Replace missing values
df["col"].replace(np.nan, new_val)
df.fillna(0)
```

# Chapter 3

### Distribution

- uniform distribution:
	- all outcomes are equally likely outcome
    - flat probability density function across the entire range.
- binomial distribution:
	- discrete probability distribution
    - binary outcomes (success or failure)
    - independent trials
- normal distribution:
    - symmetrical
    - probability never hits 0
    - described by mean and std
    - standard normal distribution has mean 0 and std 1
    - 65% area in 1-SD of mean, 95% area in 2-SD of mean, 99.7% are in 3-SD of mean
    - Central Limit Theorem : sampling distribution becomes closer to the normal distribution as the number of trials increases when sampling is done purely randomly and independently (mean of sample means/std etc).
- Poisson distribution:
    - events appear at certain rate (constant rate) over a fixed interval of time
    - expected value (lambda) represents average number of events per unit time interval
    - events occurrence is completely random
    - Discrete event (Since it represents number of events)
    - eg: 5 adoptions each week from a pet shelter. However at which time they will be adopted is random.                                                                        
- Exponential distribution:
    - probability of time between poisson events
    - Same lambda as average rate as poisson distribution
    - Continuous event (Since it represents time)
    - scale = 1/ lambda, where it measures number of time per unit event
    - example: one person requests ticket every 2 minutes. 
        - So, 1 minute serves 0.5 request. So, poisson rate of lambda = 0.5
        - And, Exponential rate of lambda, = 1/ lambda = 1/ 0.5 = 2
- t-distribution
    - tails are thicker than normal distribution
    - Observations are more likely to fall further from the mean
    - has degree of freedom that controls the thickness of the tail
        - lower degree of freedom = thicker tail + higher std
        - higher degree of freedom = thinner tail + lower std = more like normal distribution
- Log normal distribution
    - logarithm of variable is normally distributed
    - Works on mean and standard daviation

### Remove outliers

```
# Method 1
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

# Method 2
def remove_outliers(df_in, col_name):
    mean = df_in[col_name].mean()
    std = df_in[col_name].std()
    cut_off = std * 3
    lower, upper = mean - cut_off, mean + cut_off
    df_out = df_in[(df_in[col_name] < upper) & (df_in[col_name] > lower)]
    return df_out

# Method 3 : Not recommended
def trim_outliers(df_in, col_name, quantile_value=0.95):
    quantile = df_in[col_name].quantile(quantile_value)
    df_out = df_in[df_in[col_name] < quantile]
    return df_out

```

### Scaling data

```
# Feature Scaling
df["feature_scaled"] = df["col"]/ (df["col"].max())
# Min-max Scaling
df["minmax_scaled"] = (df["col"] - df["col"].min()) / (df["col"].max() - df["col"].min())
# Z-score
df["z_scaled"] = (df["col"] - df["col"].mean()) / df["col"].std() 

# Alternative : Using scikit learn
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer
minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()
log_scaler = PowerTransformer()
your_scaler.fit(df[['col']])
df['scaled_col'] = your_scaler.transform(df[['col']])
```