In this notebook, primarily we will focus on data processing techniques in python. Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.Data preprocessing is the first (and arguably most important) step toward building a working machine learning model. It's critical. If you data hasn't been cleaned and preprocessed, you model may not work. It's that simple. Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. 

As already discussed, learning algorithms have affinity towards certain data types on which they perform incredibly well. They are also known to give reckless predictions with unscaled or unstandardized features. Machine learning tools are as good as the quality of data. Sophisticated algorithms will not make up for poor data. Data needs to go through a few before it is ready for further use.

Pre-processing refers to the transformations applied to data before feeding to the algorithm. In python, Scikit-learn library has a pre-built functionality under sklearn.preprocessing. 

#### Preprocessing Data

*Sklearn* its **preprocessing library**  forms a solid foundation to guide you through this important task in the data science pipeline. It includes all utility functions and transformer classes available in sklearn, supplemented with some useful functions from other common libraries. 
 Th enotebook is structured in a **logical order** representing the order in which one should execute the transfromatins discussed. The following issues will be handled:

- Missing values

- Outlier detection

- Feature scaling

- Normalization

- Categorical features

- Numerical features

- Custom transformations

- Polynomial features  ( not discussed) 
 


In [None]:
import numpy as np

import scipy as sp

import matplotlib.pyplot as plt

import pandas as pd

import seaborn as sns

#### Import the Dataset

Lot of datasets come i CSV formats which can be read using a method called read_csv

In [None]:
df = pd.read_csv('/Users/home/Desktop/Univ/CSTU/PYTHON_NOTEBOOKS/titanic_train.csv')

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.describe()

After inspecting our dataset carefully, we are going to create a matrix of features in our dataset (X) and create a dependent vector (Y) with their respective observations.

Now we have our dataset, but we need to create a matrix of dependent variables and a vector of independent variables.

In [None]:
X = df.iloc[:,2:].values
y = df.iloc[:,1].values


### Handling missing data in Dataset

Usually you may find some data are missing. You need to handle the problem when you come across them. Handling missing values is an essential preprocessing task that can drastically deteriorate model when not done with sufficient care. A few questions should come up when handling missing values:

*Do I have missing values? How are they expressed in the data? Should I withhold samples with missing values? Or should I replace them? If so, which values should they be replaced with?*

Before starting handling missing values it is important to **identify the missing values** and know with which value they are replaced. 
 

The library that we are going to use for the task is called **Scikit-learn preprocessing**. It contains a class called **Imputer()** which will help us take care of the missing data.


#### how many missing data points are there 

In [None]:
# get the number of missing data points in each column
missing_values_df = df.isnull().sum()
print("missing values in each coulmn:", '\n')
print( missing_values_df)

print("percentage of missing values in each coulmn:", '\n')
print( missing_values_df/len(df)*100)


It might be helpful to see what percentage of the values on out dataset were missing


In [None]:
total_points = np.product(df.shape)
total_missing_values = missing_values_df.sum()

# percent of data that is missing
(total_missing_values/total_points) * 100

In [None]:
# Rows with missing values
null_data = df[df.isnull().any(axis=1)]
print("number of rows containg missing values:", null_data.shape)

null_data.head()

Figure out the reason for missing value and take appropriate action as discussed.


#### Dropping missing values
If you're sure you want to drop rows with missing values, pandas does have a handy function, dropna() to help you do this. 
*Syntax* 

DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)


In [None]:
# remove all the rows that contain a missing value
df1=df.dropna()

The above operation has a problem that it removes every row in the dataset that has atleast one missing value. We might have better luck removing all the coulmns that have at least one missing value


In [None]:
df1.shape

In [None]:
# check if there si any missing value in df1
print("missing value in new data:",df1.isnull().sum())

In [None]:
# remove all columns with atleast one missing value
df2 = df.dropna(axis =1)

In [None]:
df2.shape

In [None]:
## remove columns which have most of the nonNa's enteries

df3 = df.dropna(thresh = 885,axis =1)
df3.shape

#### Imputing missing values

**sklearn** provides a *SimpleImputer*. 

*Syntax*

sklearn.impute.SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html


In [None]:
# Fill in missing values for Age

from sklearn.impute import SimpleImputer
#imp  = SimpleImputer(missing_values = "NaN", strategy = "mean")
imp  = SimpleImputer(missing_values = np.nan, strategy = "mean")
# Age1 = np.array(df.iloc[:,5].values.reshape(-1,1))
imp.fit(df[['Age']])
df['Age'] = imp.transform(df[['Age']])
#df['Age_filled'] =Age_filled

In [None]:
# Fill in missing values for Age

from sklearn.impute import SimpleImputer
#imp  = SimpleImputer(missing_values = "NaN", strategy = "mean")
imp  = SimpleImputer(missing_values = np.nan, strategy = "mean")
Age1 = pd.DataFrame(df.iloc[:,5].values)
# Age1 = np.array(df.iloc[:,5].values.reshape(-1,1))
imp.fit(Age1)
Age_filled = imp.transform(Age1)
df['Age_filled'] =Age_filled

In [None]:
# Fill in missing values for Cabin

from sklearn.impute import SimpleImputer
#imp  = SimpleImputer(missing_values = "NaN", strategy = “most_frequent”)
imp  = SimpleImputer(missing_values = np.nan, strategy = "most_frequent")
imp.fit(df[['Cabin']])
df['Cabin'] = imp.transform(df[['Cabin']])


In [None]:
# Fill in missing values for Embarked

from sklearn.impute import SimpleImputer
#imp  = SimpleImputer(missing_values = "NaN", strategy = “most_frequent”)
imp  = SimpleImputer(missing_values = np.nan, strategy = "most_frequent")
imp.fit(df[['Embarked']])
df['Embarked'] = imp.transform(df[['Embarked']])

In [None]:
df.columns

In [None]:
# Delete the columns with missing values
#df_new = df.drop(["Age","Cabin"],axis = 1)
#df_new = df.drop(["Age","Cabin", "Embarked"],axis =1, inplace = True)

In [None]:
# Now check if there are any missing values

print(df.isnull().sum())

Pandas provide **fillna()** method to impute missing values. Pandas also provides options to fill forward (ffill) or fill backward (bfill) which are convenient whne working with time series.  Padas provide the other methods also. 

https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

In [None]:
# replace all NA's with 0
df3 = df.fillna(0)
print("shape of new data frame:", df3.shape)

print("missing values in new data frame:", df3.isnull().sum().sum())

In [None]:
df['Age'].fillna(df['Age'].mean(), inplace = True)

#### Ways to detect and remove the outliers

In [None]:
'''
### Draw the box plot from Age and fare as they are continuosu

# df_age_fare = [df['Age'], df['Fare']]  

# plt.boxplot(df_age_fare)

It is very difficult to see clearly as they are on different scale. We will plot them separatel

'''

In [None]:
plt.boxplot(df['Age'], whis=0.75)
#plt.boxplot(df['Age'], vert=False, whis=0.75)
plt.title('Box plot for Age')

In [None]:
plt.boxplot(df['Fare'], whis=0.75)
plt.title('Box plot for Fare')

In [None]:
'''with Histogram'''

plt.hist(df['Age'])
plt.grid()

In [None]:
plt.hist(df['Fare'])
plt.grid()

#### Discover outliers with Mathematical Function : 

##### Z-Score

while calculating the Z-score we re-scale and center the data and look for data points which are too far from zero. These data points which are way too far from zero will be treated as the outliers. In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

In [None]:
from scipy import stats

df_age_fare = df[['Age', "Fare"]]

df_age_fare = df_age_fare.dropna()

z = np.abs(stats.zscore(df_age_fare))
print(z)

Threshold = 3

print(np.where(z>Threshold))

The first array contains the list of row numbers and second array respective column numbers, which mean z[23][1] have a Z-score higher than 3. So the data point - 23rd on column Fare is an outlier


#### Interquartile range IQR Score

In [None]:
Q1 = df_age_fare.quantile(0.25)
Q3 = df_age_fare.quantile(0.75)
IQR = Q3-Q1
print("IQR:", '\n')
print(IQR)

lwr = Q1-1.5*IQR
upr = Q3+1.5*IQR

'''
Get the outliers. 

'''
print((df_age_fare<lwr)|(df_age_fare>upr))


##CHECK IT Get the indices.

x1 = np.where((df_age_fare<lwr)|(df_age_fare>upr))
df_age_fare.iloc[23,1]

#### Correcting or Removng the outliers

Now that we know how to detect the outliers, it is important to understand if they needs to be removed or corrected. 

In [None]:
### Z-Score

df_age_fare_outlier = df_age_fare[(z<3).all(axis=1)]

print(df_age_fare.shape)
print(df_age_fare_outlier.shape)

'''
It removed about 20 rows from the dataset i.e. outliers have been removed
'''


### IQR Score

df_age_fare_out_iqr = df_age_fare[~((df_age_fare < (Q1 - 1.5 * IQR)) |(df_age_fare > (Q3 + 1.5 * IQR))).any(axis=1)]

df_age_fare_out_iqr.shape

### Scaling and Normalization

####  Standardization ( Z-Score)

sklearn provides a function called **StandardScaler*. 

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
df['Age']= std.fit_transform(df[['Age']])

In [None]:
df["Age"][1:10,]

#### Minmax

sklearn.preprocessinxg.MinMaxScaler(feature_range=(0, 1), copy=True)

In [None]:
##### Min Max normalization

# Python provides th min max normalization function in the preprocessing module. 

from sklearn.preprocessing import MinMaxScaler
min_max = MinMaxScaler(feature_range = (0,1))
df['Age']= min_max.fit_transform(df[['Age']])

In [None]:
df["Age"].min()

#### Normalizing Data
We rescale each observation to a length of 1 (a unit norm). For this, we use the Normalizer class. Let’s take an example.

In [None]:
from sklearn.preprocessing import Normalizer
normal_scaler = Normalizer()
df['Age']= normal_scaler.fit_transform(df[['Age']])

In [None]:
df.dtypes

#### Encoding categorical data
Sometimes we have categorical varibales in our data. Since the models are based on mathematical equations and caluclations we have to encode the categorical data. Sklearn provides a very efficient tool for encoding the levels of a categorical features into numeric values. There is a class in the library called LabelEncoder which we will use 


In [None]:
df.columns

In [None]:
# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# creating instance of labelencoder
from sklearn import preprocessing
labelencoder = preprocessing.LabelEncoder()
# Assigning numerical values and storing in another column
bridge_df['Bridge_Types_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])
bridge_df


In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()

df['Embarked']= labelencoder_X.fit_transform(df['Embarked'])
df['Embarked']

#### One-Hot Encoder

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html


sklearn.preprocessing.OneHotEncoder(categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')

In [None]:
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
onehot_enc = OneHotEncoder(handle_unknown='ignore')


onehot_Embarked = pd.DataFrame(onehot_enc.fit_transform(df[['Embarked']]).toarray())
#df = pd.concat(df,onehot_Embarked)
df = pd.concat([df, onehot_Embarked], axis=1).reindex(df.index)

In [None]:
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
onehot_enc = OneHotEncoder(handle_unknown='ignore')


onehot_Sex = pd.DataFrame(onehot_enc.fit_transform(df[['Sex']]).toarray(),columns = ["F","M"])
df = pd.concat([df, onehot_Sex], axis=1).reindex(df.index)


#### Using dummies values approach:
This approach is more flexible because it allows encoding as many category columns as you would like and choose how to label the columns using a prefix. Proper naming will make the rest of the analysis just a little bit easier.

In [None]:
df.head()

In [None]:
#dummy_Sex = pd.get_dummies(df, columns = ["Sex", "Embarked"], prefix = ["Type_is"])

dummy_Sex = pd.get_dummies(df[["Sex"]])


dummy_Embarked = pd.get_dummies(df[["Embarked"]])
df = pd.concat([df, dummy_Sex,dummy_Embarked], axis=1).reindex(df.index)

#### Discretization

Discretization, also known as quantization or binning, divides a continuous feature into a pre-specified number of categories (bins), and thus makes the data discrete.

One of the main goals of a discretization is to significantly reduce the number of discrete intervals of a continuous attribute. Hence, why this transformation can increase the performance of tree based models.

*Sklearn* provides a **KBinsDiscretizer**  class that can take care of this. The only thing you have to specify are the number of bins (n_bins) for each feature and how to encode these bins (ordinal, onehot or onehot-dense). The optional strategy parameter can be set to three values:

- *uniform*, where all bins in each feature have identical widths.
- *quantile* (default), where all bins in each feature have the same number of points.
- *kmeans*, where all values in each bin have the same nearest center of a 1D k-means cluster.

*Syntax*

sklearn.preprocessing.KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
#est = KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='uniform')
est = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')

#df['Fare'] = est.fit_transform(df[['Fare']])
Fare_disc = est.fit_transform(df[['Fare']])

#### Custom Transformation
If you want to convert an existing function into a transformer to assist in data cleaning or processing, you can implement a transformer from an arbitrary function with **FunctionTransformer**

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html

**syntax**
sklearn.preprocessing.FunctionTransformer(func=None, inverse_func=None, validate=False, accept_sparse=False, check_inverse=True, kw_args=None, inv_kw_args=None)


In [None]:
from sklearn.preprocessing import FunctionTransformer
import math
transformer_log = FunctionTransformer(np.log)
X = np.array([[10, 20], [30, 40]])
print(X,'\n')
print("log transformation:")
print(transformer_log.transform(X))


# Uisng Pandas apply function.

print(pd.DataFrame(X).apply(np.log))



#### References
https://towardsdatascience.com/preprocessing-with-sklearn-a-complete-and-comprehensive-guide-670cb98fcfb9

https://www.simplilearn.com/data-preprocessing-tutorial

https://www.deeplearning-academy.com/p/ai-wiki-data-preprocessing

https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825

https://data-flair.training/blogs/python-ml-data-preprocessing/

https://haridas.in/outlier-removal-clustering.html  # outlier using cluster

https://www.pluralsight.com/guides/cleaning-up-data-from-outliers