<center><h1> GUIDE </h1></center>

### Onehot Encoding and dropping obvious first feature

In [None]:
# One hot Encoding and dropping first feature

def enc(feature):
    dummy = pd.get_dummies(df_train[[feature]],drop_first=True)
    return dummy

airline_dummy = enc('Airline')
source_dummy = enc('Source')
destination_dummy = enc('Destination')
total_stops_dummy = enc('Total_Stops')

# Concat encoded columms to original dataframe

df_train_new = pd.concat([df_train,airline_dummy,source_dummy,destination_dummy,total_stops_dummy],axis=1)

### CatPlot - Categorical vs continuous plotting to check outliers and value counts

In [None]:
sns.catplot(y='Price',x='Airline',data=df_train.sort_values('Price',ascending=False),kind='boxen',height=6,aspect=3)
plt.show()

### Heatmap - plotting heatmap for correlation matrix using matplotlib and seaborn

In [None]:
plt.figure(figsize=(10,12))
sns.heatmap(df_train1.corr(), annot=True, cmap='RdYlGn')

### Multi-Collinearity - Variation Inflation factor for multicollinearity

In [None]:
# Multicollinearity VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor

def vif(X):
    vif = pd.DataFrame()
    vif['variables'] = X.columns
    vif['vif'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
    return vif

### Metrics from Sklearn

In [None]:
from sklearn import metrics

print('MSE',metrics.mean_squared_error(y_test,y_pred))
print('R2',metrics.r2_score(y_test,y_pred))
print('Adjusted R2', 1 - ((1 - metrics.r2_score(y_test,y_pred)) * (len(X_test) - 1) / (len(X_test) - X_test.shape[1] - 1)))

### Cross Validation(CV) - K-fold RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
# Number of features to considers at every split
max_features = ['auto','sqrt']
# Maximum number of levels in a tree
max_depth = [int(x) for x in np.linspace(start=10, stop=110, num=11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2,5,10]
# Minimum number of samples required at leaf node
min_samples_leaf = [1,2,4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create Random Grid
random_grid = {'n_estimators':n_estimators,
              'max_features':max_features,
              'max_depth':max_depth,
              'min_samples_split':min_samples_split,
              'min_samples_leaf':min_samples_leaf,
              'bootstrap':bootstrap}

print(random_grid)

rf = RandomForestRegressor

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = 3,
                              verbose = 2, random_state = 42, n_jobs = -1)

rf_random.fit(X_train,y_train)

rf_random.best_params_

In [None]:
prediction = rf_random.predict(X_test)

### Saving model as Pickle file

In [None]:
# Saving model as pickle file
import pickle

file = open('flight.pkl', 'wb')

pickle.dump(rf_random, file)

### Loading Model from pickle file

In [None]:
# Loading model from pickle file
import pickle 

model = open('flight.pkl', 'rb')

rf_model = load(model)

### * Feature Scaling

Gradient based algorithms: 
    Algorithms like Linear Regression, Logistic Regression and Neural networks,etc. that use gradient descent as optimization technique require data to be scaled. "Have features on same scale can help the gradient descent to converge more quickly towards global minima".

Distance based algorithms:
    Algorithms like SVM, KNN, K-means,etc. are more affected by the range of features. This is because behind the scenes, they are calculating distance between data points to determine their similarity.
    
Tree base algorithms are fairly insensitive to the scale of features. Since there is no effect of other feature on split of a node in tree based algorithms, they are not effected by scale.
    
Two types of Feature scaling techniques:
    1. Normalization
    2. Standardization

Normalization: (ranges between 0 and 1) Normalization is good when your data is not following a gaussian distribution.
    Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging from 0 and 1. It is also known as Min Max Scaling.
    
    X' = (X - Xmin) / (Xmax-Xmin)
    where, Xmin = minimum value of the feature
           Xmax = maximum value of the feature
           X = Present value in the feature
           X' = New scaled value in the feature
           
Standardization: standardization is good when data is following gaussian distribution.
    It is another scaling technique where the values are centered around mean with a unit standard deviation. This means mean of the attribute becomes 0 and resultant standard deviation becomes 1.
    
    X' = (X - μ)/σ
    where, X' = New scaled value in the feature
           X = Present value in the feature
           μ = mean
           σ = standard deviation
           

    

### * Outliers

We generally define outliers as samples that are exceptionally far from the mainstream of the data.

Three ways to detect outliers:
    1. Standard Deviation method
    2. Inter Quartile range method
    3. Automatic outlier detection
    
Standard Deviation method:
    If we know that the distribution of values in a sample is gaussian or gaussian-like distribution, we can use the standard deviation of the sample as a cut off for identifying outliers.
    1. 1 standard deviation from the mean : 68%
    2. 2 standard deviation from the mean : 95%
    3. 3 standard deviation from the mean : 99.7%
A value that falls outside of the 3 standard deviations is part of distribution, but an unlikely or rare event.
Three standard deviations from mean as cut off is a standard practice for gaussian or gaussian like distributions. For small sample we use 2 standard deviations, for large samples we can use upto 4 standard deviations.


    import numpy as np
    
    mean, std = np.mean(data), np.std(data)
    cut_off = std * 3
    lower, upper = mean - cut_off, mean + cut_off
    
    outliers = [x for x in data if x < lower or x > lower]
    outliers_removed = [x for x in data if x > lower and x < upper]


Inter Quarantile Range(IQR):
    Not all data is normally distributed or not normal enough to treat it as being drawn from gaussian distribution.
    A good statistic for summarizing non-gaussian distribution data is the interquartile range. Inter quartile range is the difference between 75 and 25 percentile of the data and defines the box in boxplot or whisker plot. Percentiles can be calculated by sorting the observations and selecting values at specific indices. The 50th percentile is the middle value or the average of 5000 and 5001 index value in a 10000 rows data.
    IQR defines middle 50% of the data, or the data body.
    
 
    import numpy as np
    q25, q75 = np.percentile(data,25), np.percentile(data,75)
    iqr = q75 - q25
    
    cut_off = iqr * 1.5
    lower = q25 - cut_off
    upper = q75 + cut_off
    
    # use limits
    outliers = [x for x in data if x < lower or x > lower]
    outliers_removed = [x for x in data if x > lower and x < upper]

    
Automatic Outlier Detection:
    In machine learning, an approach to tackling the problem of outlier detection is one-class classification. One class classification or OOC for short, involves fitting a model on the normal data and predicting if new data is normal or outlier/anamoly.
    The local outlier factor or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a score of how isolated it is based on the size of the neighborhood.

    lof = LocalOutlierFactor()
    yhat = lof.fit_predict(X_train)
    
    # select all rows that are not outliers 
    mask = yhat != -1
    X_train,y_train = X_train[mask, :], y_train[mask]
    
    model = LinearRegression()
    model.fit(X_train,y_train)
    


### *Gaussian Distribution or Normal distribution test

Types of scenarios:
1. If input feature is numerical, use distplot or box plot or qqplot or Kolmogorov Smirnov test or Shapiro-Wilk test
2. If input feature is categorical, go directly to feature selection like corr matrix with heatmap or anova test is output feature is numerical, if output feature is categorical, go with Chi-square test. 

    sns.distplot(df['feature'])

or

shapiro-wilk test: if p-value is greater than 0.05, we assume normal distribution, if it is lower, we do not assume gaussian distribution.

    from statsmodels.stats import shapiro
    shapiro(data)


### Feature Selection of Categorical Variables

1. input numerical and output numerical -- correlation matrix(pearson correlation coefficient), VIF
2. input categorical and output numerical -- ANOVA test
3. input numerical and output categorical -- ANOVA test
4. input categorical and output cactegorical -- Chi2 test

### Grid Search CV for Linear Regression

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {'normalize':[True,False]}
lr_cv_model = GridSearchCV(model,parameters,refit=True,cv=5)
best_model = lr_cv_model.fit(X_train,y_train)


### SVM and Tree based models

No normality assumptions are needed for tree based models and SVM.

Types of classifications:</br>
    1. Binary Classification - Logistic Regression, K-Nearest Neighbors, Decision trees, Support vector machine, Naive Bayes</br>
    2. Multiclass Classification - K-Nearest Neighbors, Decision trees, Naive Bayes, Random Forest, Gradient Boosting</br>
    3. Imbalanced Classification - Usually observed in binary classification datasets where normal or abnormal samples 
    are not in same ratio.</br>
        When we encounter unbalenced binary classification, use Random undersampling or SMOTE oversampling to balance the 
        dataset and then use cost sensitive logistic regression or cost sensitive Decision trees or cost sensitive SVM and use performance metrics like precision, recall and f-measure.

In [None]:
# Decision Tree Algorithim for classification
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier
model.fit(X_train,y_train)

In [None]:
# Random Forest Algorithm for classification
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train,y_train)

In [None]:
# Logistic Regression for binary classification
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train,y_train)

In [None]:
# K Nearest Neighbors(KNN) for classification
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(X_train,y_train)

In [None]:
# Support Vector Machine(SVM) for classification
from sklearn.svm import SVC

model = SVC()
model.fit(X_train,y_train)

In [None]:
# Naive Bayes for classification
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train,y_train)

In [None]:
# Gradient Boosting algorithm for classification
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier
model.fit(X_train,y_train)