# Chapter 1 & Chapter 2

### Data Preprocessing

- Comes after data cleaning and Exploratory Data Analysis (EDA)
- pre-requisite for modeling
- Helps to:
    - produce more reliable results
    - Improve model performance
- inspect dataset
- See summary statistics
- Deal with missing values
- Convert to specified column types
- Split into training and testing set (Take class imbalance into account)
    - Data leakage : non-training data is used to train the model
- Standardize data : Transform numeric data to make it normally distributed
    - Non-normal data introduce bias for some features due to its high variance 
    - Non-normal data introduce model underfitting due to difference in scales among different features
    - Log-normalization, standard scaling
    - Tree-based models can be trained without standardization
    - The other models like linear models or dataset with high dimensions requires standardization
- Feature Engineering:
    - eg : vector of text
    - eg : resampling time data (changing time granularity : from second to week, month etc)
    - eg : one-hot encoding
- 

```
# Inspect dataset
df.head()
df.info()
df.describe() # Summary stats

# DEAL WITH MISSING VALUES
df.drop([1, 2, 3]) # Drop specific rows
df.dropna(thresh=2) # keep at least 2 non-missing values in each row
df.dropna(subset=['C']) # Drop missing values of specified column

# Convert column types
df["C"] = df["C"].astype("float")

# Verify class imbalance
y.value_counts()

# Split into training and testing data (Consider class imbalance)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# STANDARDIZE DATASET
df.var() # Detect high variance difference in columns are candidates of log normalization

# FEATURE ENGINEERING
# TEXT PROCESSING (cleaning / vectorizing / regular expression )
```

### Deal with Missing values

```
# Check missing data
df.isna().any()
df.isna().sum()
# Visualize missing data information
import missingno as msno
import matplotlib.pyplot as plt
msno.matrix(df)
plt.show()

# Drop missing data column
df_dropped = df.dropna(subset = ['col'], axis = 1) # 0 for row
df.dropna(axis = 0) # Drop entire row for missing value (default)
df.dropna(axis = 1) # Drop entire column for missing value

# Replace/impute missing data with single value
col_mean = df['col'].mean()
df_imputed = df.fillna({'col': col_mean})
df['col'].replace(to_replace=np.nan, value = some_mean,inplace = True) # Alternative
# Replace/impute missing data with series
series_imp = df['col1'] * 5
df_imputed = df.fillna({'col2':series_imp})

df["col"].value_counts() # Look out for suspicious values

##### Strategic dropping example ########
# Drop missing values where <= 5% of data in column are missing , otherwise impute values
threshold = len(df) * 0.05
cols_to_drop = df.columns[df.isna().sum() <= threshold]
df.dropna(subset=cols_to_drop, inplace=True)
cols_with_missing_values = df.columns[salaries.isna().sum() > 0]
for col in cols_with_missing_values[:-1]:
    df[col].fillna(df[col].mode()[0])
subgroup_dict = df.groupby("cat_col")["num_col"].median().to_dict()
df["num_col"] = df["num_col"].fillna(df["cat_col"].map(subgroup_dict))
```

### Standardize Dataset

```
# Feature Scaling
df["feature_scaled"] = df["col"]/ (df["col"].max())
# Min-max Scaling
df["minmax_scaled"] = (df["col"] - df["col"].min()) / (df["col"].max() - df["col"].min())
# Z-score
df["z_scaled"] = (df["col"] - df["col"].mean()) / df["col"].std() 

# Alternative : Using scikit learn
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer
minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()
log_scaler = PowerTransformer()
your_scaler.fit(df[['col']])
df['scaled_col'] = your_scaler.transform(df[['col']])
```

# Chapter 3

### Feature Engineering

```
# Visualize distribution
sns.pairplot(df)
df[['column_1']].boxplot()
plt.show()

# One-hot encoding
pd.get_dummies(df, columns=['cat'], prefix='C')
# Dummy encoding
pd.get_dummies(df, columns=['cat'], drop_first=True, prefix='C')

# Merging low frequency categorical counts
counts = df['cat'].value_counts()
mask = df['cat'].isin(counts[counts < 5].index) 
df['cat'][mask] = 'Other'

# Binarizing numeric variables
df['Binary_col'] = 0 
df.loc[df['Number_col'] > 0, 'Binary_col'] = 1
import numpy as np
df['Binned_Group'] = pd.cut( df['Number_col'], bins=[-np.inf, 0, 2, np.inf], labels=[1, 2, 3])

# SCALE / STANDARDIZE DATA
# DEAL WITH MISSING VALUES.....
# DEAL WITH OUTLIERS

# Validate numeric columns
df['RawSalary'] = df['RawSalary'].str.replace(',', '').astype('float')
coerced_vals = pd.to_numeric(df['RawSalary'], errors='coerce')
print(df[coerced_vals.isna()].head()) # Sanity check which values still show errors
```

### One-hot encoding

```
# Binary Encoding
df["cat_col"] = df["cat_col"].apply(lambda val: 1 if val == "y" else 0)

# One-hot-encoding on categorical variable
df_onehot = pd.get_dummies(df, columns=['cat'], prefix='C')
df_dummy = pd.get_dummies(df, columns=['cat'], drop_first=True, prefix='C')

# Alternative approach-2
from sklearn import preprocessing
encoder = preprocessing.OneHotEncoder()
onehot_transformed = encoder.fit_transform(df['cat_col'].values.reshape(-1,1))
# Convert into dataframe
onehot_df = pd.DataFrame(onehot_transformed.toarray())
# Add the encoded columns with original dataset, 
df = pd.concat([df, onehot_df], axis=1)
# Drop the original column that you used for encoding 
df = df.drop('cat_col', axis=1)

# Label encoding : Turning string labels into numeric values
from sklearn import preprocessing
encoder_lvl = preprocessing.LabelEncoder()
# Specify the unique categories in the column to apply one-hot encoding
encoder_lvl.fit([ 'LOW', 'NORMAL', 'HIGH'])
# Apply one hot encoding on the third column of the dataset
df[:,2] = encoder_lvl.transform(df[:,2]) 
```

### Text Processing

```
# Remove non-letter characters
speech_df['text'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ', regex=True)
# Standardize case
speech_df['text'] = speech_df['text'].str.lower()
# Generate Feature : Average length of word
speech_df['char_cnt'] = speech_df['text'].str.len()
speech_df['word_cnt'] = speech_df['text'].str.split().apply(len)
speech_df['avg_word_len'] = speech_df['char_cnt'] / speech_df['word_cnt']

# Generate Feature : tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
from nltk.corpus import stopwords
vec = TfidfVectorizer(max_df=0.9, min_df=0.1, max_features=100, stop_words=stop_words) 

# Generate Feature : Bag of words / Word Count Vector
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_features=100, stop_words='english', min_df=0.1, max_df=0.9)

# Generate Feature : Introduce context with n-grams
vec = TfidfVectorizer(max_df=0.9, min_df=0.1, max_features=100, stop_words=stop_words, ngram_range = (2,2)) # Find context in 2 consecutive words
vec.fit(speech_df['text'])
transformed = vec.transform(speech_df['text'])
vec_df = pd.DataFrame(transformed.toarray(), columns=vec.get_feature_names_out()).add_prefix('Counts_')

# Sanity check : Find common words / patterns
vec_df.iloc[0].sort_values(ascending=False).head()
vec_df.sum().sort_values(ascending=False).head()

speech_df = pd.concat([speech_df, vec_df], axis=1, sort=False)
```