## Introduction to Machine Learning: Basic Concepts

### Supervised Learning


<img src="https://miro.medium.com/max/2494/1*nJCYz0UIJNtRF7Et2Pi_aQ.png" width="600">


### Unsupervised Learning

<img src="https://static.packt-cdn.com/products/9781788393485/graphics/f8315f3f-703d-4929-bd7b-cd40db553fc5.png" width="600">

### Reinforcement Learning

<img src="https://miro.medium.com/max/3084/0*WC4l7u90TsKs_eXj.png" width="600">


## Main Components

* Data
* Loss Function
* Algorithm

### Data

Data should be in a tabular format in which each datapoint corresponds to the atomic information in the set. ML algorithms can only process numeric data.

**Time series**
![](https://i.ibb.co/jLKD1c8/timeseries.png)
&nbsp;
&nbsp;
&nbsp;
![](https://i.ibb.co/RPhdN96/timeseries2.png)



**Numeric data**
![](https://miro.medium.com/max/1048/1*k7ZifUD4IFuiN06TVnVIVw.png)



**Images**

![](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/02/Plot-of-a-Subset-of-Images-from-the-MNIST-Dataset.png)

![](https://jamesmccaffrey.files.wordpress.com/2014/06/mnist_viewer_demo.jpg)




**Text**
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/BoWBag-of-Words-model-2.png)


### Algorithm

* Linear 
* Tree Based 
* Ensemble Models
* Neural Networks

### Linear Models

![](https://i0.wp.com/cmdlinetips.com/wp-content/uploads/2020/03/Linear_Regression_Beta_Hat_Matrix_Multiplication.png?resize=561%2C136&ssl=1)


&nbsp;
&nbsp;
&nbsp;

or more simply:

&nbsp;
&nbsp;
&nbsp;


<img src="https://static.packt-cdn.com/products/9781789537123/graphics/78c4af48-3b33-4cbd-bc15-45aeb0f8833e.png" width="200">

&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;

We aim to find the best function to fit for our parameters a and b.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSLOy-ZBkyQ11sPq0URHUrWwJvhpZNGNsWDA1Un19rmCyTnaE6V">

&nbsp;
&nbsp;
&nbsp;

### Tree Based Algorithms
![](https://cdn-images-1.medium.com/max/824/0*J2l5dvJ2jqRwGDfG.png)

### Ensemble Models
Combining multiple models and voting the best average results.
![](https://miro.medium.com/max/600/1*sSSHJeUE2WHp3xD35NoJ9w.png)

### Neural Networks
Single Neuron
![](https://miro.medium.com/max/1826/1*L9xLcwKhuZ2cuS8fF0ZjwA.png)

More Complex Architectures
![](https://www.asimovinstitute.org/wp-content/uploads/2019/04/NeuralNetworkZoo20042019.png)


### Loss Function

![](https://miro.medium.com/max/500/0*gglavDlTUWKn4Loe)

For tree based methods, Gini index and Entrophy Loss are calculated. 
&nbsp;
For K-NN, distance between data points are calculated. 

#### Gradient Descent
While finding out our optimum function parameters we need to minimize our error  ei . To reduce computational complexity, we use Gradient Descent.

Gradient Descent is a method to find the local minimum for the cost function that is defined by function parameter and errors.

![](https://pbs.twimg.com/media/EK2QGHyW4AAooxn?format=png&name=small)

#### Activation Functions & Back Propagation
Neural networks require activation functions to transmit their output over their Axons.

According to the loss function at the output neuron, we can rearrange the weights on the previous paths in the network. 

![](https://missinglink.ai/wp-content/uploads/2018/11/activationfunction-1.png)



### Now a Sample Application
Let's put in use, what we have learnt.

The task is to estimate the passenger satisfaction in a flight.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import lightgbm as lgb
import xgboost as xgb
import catboost as cb
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_auc_score, accuracy_score, plot_confusion_matrix, classification_report
from scipy import stats

import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
directory = "/kaggle/input/airline-passenger-satisfaction/"
feature_tables = ['train.csv', 'test.csv']

df_train = directory + feature_tables[0]
df_test = directory + feature_tables[1]

# Create dataframes
print(f'Reading csv from {df_train}...')
train = pd.read_csv(df_train)
print('...Complete')

print(f'Reading csv from {df_train}...')
test = pd.read_csv(df_test)
print('...Complete')

Let's add some anomalies

In [None]:
train.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
df_bad = pd.DataFrame([[129881,np.nan,"Loyal Customer" ,23,"Personal Travel","Eco Plus",275,1,1,1,3,3,1,5,5,1,4,3,2,1,2,2,5,"satisfied"]
,[129882,"Male","Loyal Customer" ,23,"Personal Travel","Eco Plus",270,1,3,1,4,5,5,2,4,2,5,4,1,4,1,2,1,"satisfied"]
,[129883,"Female","Loyal Customer" ,27,"Personal Travel",np.nan,281,3,2,2,3,3,1,4,1,4,4,2,2,2,1,2,3,"satisfied"]
,[129884,"Male","Loyal Customer" ,66,"Personal Travel","Eco Plus",125,5,4,3,2,4,1,4,5,2,1,1,3,2,2,2,4,"neutral or dissatisfied"]
,[129885,"Female","Loyal Customer" ,14,"Personal Travel","Eco Plus",151,4,2,2,2,4,np.nan,1,3,4,2,4,np.nan,4,3,1,3,"satisfied"]
,[129886,"Mal",np.nan,23,"Personal Travel","Eco Plus",177,5,5,3,1,3,4,2,4,3,1,1,5,1,2,1,2,"satisfied"]
,[129887,np.nan,"Loyal Customer" ,200,"Personal Travel","Eco Plus",216,3,2,1,3,1,5,1,3,5,4,5,3,2,1,2,4,"satisfied"]
,[129888,"Male","Loyal Customer" ,28,"Personal Travel",np.nan,162,5,4,1,4,4,1,4,1,1,3,2,5,1,5,1,2,"neutral or dissatisfied"]
,[129889,"Female","Loyal Customer" ,34,np.nan,"Eco Plus",159,4,5,1,4,5,2,4,5,np.nan,2,1,2,3,5,4,1,"satisfied"]
,[129890,"Male","Loyal Customer" ,51,"Personal Travel","Eco Plus",242,3,3,5,4,3,4,3,4,2,1,2,2,3,3,2,5,"satisfied"]
,[129891,"Female","Loyal Customer" ,44,"Personal Travel","Eco Plus",207,2,5,2,5,4,1,4,5,5,5,1,2,4,3,4,1,"neutral or dissatisfied"]
,[129892,"Male",np.nan,1,"Personal Travel","Eco Plus",-220,1,4,5,np.nan,3,5,5,5,5,5,4,3,2,4,4,2,"satisfied"]
,[129893,np.nan,"Loyal Customer" ,14,np.nan,"Eco Plus",113,1,5,4,3,3,5,5,2,3,3,2,1,3,3,3,3,"neutral or dissatisfied"]
,[129894,"Male","Loyal Customer" ,61,"Personal Travel","Eco Plus",184,3,4,2,5,2,3,2,1,5,3,1,5,2,1,3,1,"satisfied"]
,[129895,"Female","Loyal Customer" ,48,"Personal Travel","Eco Plus",155,5,4,4,1,5,3,5,4,2,2,5,3,2,np.nan,1,3,"satisfied"]
,[129896,"Male","Loyal Customer" ,199,"Personal Travel","Eco Plus",171,5,2,3,2,5,3,4,3,3,1,4,1,2,5,3,4,"neutral or dissatisfied"]
])
#pd.DataFrame()

df_bad.columns = train.columns
train = train.append(df_bad)

### Exploratory Data Analysis: Understanding the data

In [None]:
train.head()

In [None]:
train.describe()

Let's check if we have null values

In [None]:
train.isnull().any()

In [None]:
for col in train.columns:
    if train[col].isnull().any():
        print(col, train[col].dtype, )

In [None]:
train[train['Customer Type'].isna()]

In [None]:
train['Customer Type'].value_counts()

In [None]:
train['Customer Type'] = train['Customer Type'].fillna("Loyaly Customer")
train['Customer Type'].isnull().any()

In [None]:
train['Type of Travel'] = train['Type of Travel'].fillna("Business travel")

train['Gender'] = train['Gender'].fillna("Female")

train['Class'] = train['Class'].fillna("Business")

train['Gate location'] = train['Gate location'].fillna(train['Gate location'].median())

train['Online boarding'] = train['Online boarding'].fillna(train['Online boarding'].median())

train['On-board service'] = train['On-board service'].fillna(train['On-board service'].median())

train['Checkin service'] = train['Checkin service'].fillna(train['Checkin service'].median())

train['Cleanliness'] = train['Cleanliness'].fillna(train['Cleanliness'].median())
'''
Arrival Delay in Minutes float64'''

In [None]:
sns.distplot(np.log1p(train['Arrival Delay in Minutes']), bins=5)


In [None]:
#Should I fill with 0 or median value?

#Vote in: https://www.strawpoll.me/33402615

#train['Arrival Delay in Minutes'] = train['Arrival Delay in Minutes'].fillna(train['Arrival Delay in Minutes'].median())
train['Arrival Delay in Minutes'] = train['Arrival Delay in Minutes'].fillna(0)


In [None]:
train.isnull().any()

### Data Characteristics

Let's check our data distribution

In [None]:
train.describe()

In [None]:
sns.boxplot(train['Age'])

In [None]:


sns.scatterplot(data=train, x="id", y="Flight Distance")

#which ones are the outliers?
#https://www.strawpoll.me/33402725

In [None]:
train = train[train['Age']<=120]

train = train[train['Flight Distance']>0]

In [None]:
fig = plt.figure(figsize=(15, 12))

subset = train[["id", "Age","Flight Distance", "Inflight wifi service","Departure/Arrival time convenient","Ease of Online booking","Gate location","Food and drink","Online boarding","Seat comfort","Inflight entertainment","On-board service","Leg room service","Baggage handling","Checkin service","Inflight service","Cleanliness","Departure Delay in Minutes","Arrival Delay in Minutes",]]
for i in range(1, subset.shape[1]):
    plt.subplot(7, 7, i)
    f = plt.gca()
    f.axes.get_yaxis().set_visible(False)
    # f.axes.set_ylim([0, train.shape[0]])

    vals = np.size(subset.iloc[:, i].unique())
    if vals < 10:
        bins = vals
    else:
        vals = 10

    plt.hist(subset.iloc[:, i], bins=10)

plt.tight_layout()


In [None]:
train.satisfaction.value_counts()

We have some non-numerical columns. How do we process them?
https://www.strawpoll.me/33402755

In [None]:
str_cols = ["Gender", "Customer Type", "Type of Travel", "Class", "satisfaction"]


for col in str_cols:
    print(col)
    print(train[col].value_counts())
    print("+---------------------------------+")


In [None]:
train['Gender'] = np.where(train['Gender'] == 'Mal', 'Male', train['Gender'])
train['Customer Type'] = np.where(train['Customer Type'] == 'Loyaly Customer', 'Loyal Customer', train['Customer Type'])

In [None]:
cat_cols = ["Gender", "Customer Type", "Type of Travel", "Class"]



In [None]:
pd.get_dummies(train['Gender'], columns=["Gender"])

In [None]:
train[cat_cols].head()

In [None]:
df_cat = pd.get_dummies(train[cat_cols], columns=cat_cols)
df_cat.head()

In [None]:
train.info()

In [None]:
num_cols = ["Age","Flight Distance","Inflight wifi service","Departure/Arrival time convenient","Ease of Online booking","Gate location","Food and drink","Online boarding","Seat comfort","Inflight entertainment","On-board service","Leg room service","Baggage handling","Checkin service","Inflight service","Cleanliness","Departure Delay in Minutes","Arrival Delay in Minutes", "satisfaction"]
df_num = train[num_cols]
df_num["satisfaction"] = np.where(df_num["satisfaction"] == "satisfied", 1, 0)
df_num.head()

In [None]:
df = pd.concat([df_num, df_cat], axis=1)

What about passenger grades? Are they numerical or categorical?
https://www.strawpoll.me/33402823

Features' relation with our target

In [None]:
df.columns

In [None]:
df = df[['Age','Flight Distance','Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking','Gate location','Food and drink','Online boarding','Seat comfort','Inflight entertainment','On-board service','Leg room service','Baggage handling','Checkin service','Inflight service','Cleanliness','Departure Delay in Minutes','Arrival Delay in Minutes','Gender_Female','Gender_Male','Customer Type_Loyal Customer','Customer Type_disloyal Customer','Type of Travel_Business travel','Type of Travel_Personal Travel','Class_Business','Class_Eco','Class_Eco Plus','satisfaction']]
df.head()

In [None]:
fig_dims = (80, 100)
fig, ax = plt.subplots(figsize=fig_dims)
sns.set(font_scale=2)
sns.heatmap(df.corr(), annot=True, ax = ax)



Gate Location, Departure/Arrical time convenient,  seems to be irrelevant. 

Food and Drink seem to be multicollinear with Cleanliness, Inflight entertainment and Seat comfort

Categorical dummy variables seem to be highly multicollinear. Shouldn't it be?

In [None]:
df[['Gate location', 'satisfaction']]


sns.jointplot(data=df[['Gate location', 'satisfaction']], x="Gate location", y="satisfaction")


In [None]:

sns.jointplot(data=df[['Departure/Arrival time convenient', 'satisfaction']], x="Departure/Arrival time convenient", y="satisfaction")


### Building the Model
Now we have our data in shape, now time to build our model and investigate the results.

In [None]:
df.columns

In [None]:
#'Departure/Arrival time convenient' is removed

#Define our features and target (this is helpful in case you would like to drop any features that harm model performance)
features = ['Age','Flight Distance','Inflight wifi service','Departure/Arrival time convenient','Ease of Online booking','Gate location','Food and drink','Online boarding','Seat comfort','Inflight entertainment','On-board service','Leg room service','Baggage handling','Checkin service','Inflight service','Cleanliness','Departure Delay in Minutes','Arrival Delay in Minutes','Gender_Female','Gender_Male','Customer Type_Loyal Customer','Customer Type_disloyal Customer','Type of Travel_Business travel','Type of Travel_Personal Travel','Class_Business','Class_Eco','Class_Eco Plus']
target = ['satisfaction']

from sklearn.model_selection import train_test_split

X = df[features]
y = df[target].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Normalize Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

## Model Evaluation

Now let's create our model

#### how do we measure our models' performance
Precision & Recall
![](https://cdn-images-1.medium.com/fit/t/1600/480/1*Ub0nZTXYT8MxLzrz0P7jPA.png)

F1-score
![](https://miro.medium.com/max/1624/0*DaA9fG5env3JGp2k)


In [None]:
def run_model(model, X_train, y_train, X_test, y_test, verbose=True):
    if verbose == False:
        model.fit(X_train,y_train, verbose=0)
    else:
        model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    roc_auc = roc_auc_score(y_test, y_pred.round())
    print("ROC_AUC = {}".format(roc_auc))
    print(classification_report(y_test,y_pred.round(),digits=5))
    plot_confusion_matrix(model, X_test, y_test.round(),cmap=plt.cm.Blues, normalize = 'all')
    
    return model, roc_auc

### Linear Regression



In [None]:
from sklearn.linear_model import LogisticRegression



model_lr = LogisticRegression()
model_lr, roc_auc_lr = run_model(model_lr, X_train, y_train, X_test, y_test)

#### What is the value with this approach?

In [None]:
for i in range(len(X.columns)):
    print (X.columns[i] , 'x ',model_lr.coef_[0][i],  ' + ')

print(model_lr.intercept_[0])

In [None]:
model_lr.coef_

### Random Forest 

A great place to start! We are seeing an accuracy of 96.30%. Lets see if the other models can compare.

In [None]:
params_rf = {'max_depth': 25,
         'min_samples_leaf': 1,
         'min_samples_split': 2,
         'n_estimators': 200,
         'random_state': 42}

model_rf = RandomForestClassifier(**params_rf)
model_rf, roc_auc_rf = run_model(model_rf, X_train, y_train, X_test, y_test)

What could go wrong?
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/1200px-Overfitting.svg.png" width="600">

### XGBoost

Notorious XGBoost

In [None]:
params_xgb ={}

model_xgb = xgb.XGBClassifier(**params_xgb)
model_xgb, roc_auc_xgb = run_model(model_xgb, X_train, y_train, X_test, y_test)

In [None]:
auc_scores = [roc_auc_lr, roc_auc_rf, roc_auc_xgb]
model_scores = pd.DataFrame(auc_scores, index=['Linear Regression','Random Forest','XGBoost'], columns=['AUC'])
model_scores.head()