## Installing packages
* pip install lightgbm
* conda install -c conda-forge xgboost
* pip install xgboost

## **Stack OverFlow: EDA & ML**  

**Elijah Zolduoarrati**  
**Approaches and Techniques:**

* EDA with Pandas and Seaborn
* Find features with strong correlation to target variables questions and answers
* Data preprocessing, converting categorical features mainly (country) to numerical
* apply the basic Regression models of sklearn 
* use gridsearchCV to find the best parameters for each model
* compare the performance of the Regressors and choose best one

**The notebook is organized as follows:**

* **[Part 0: Imports, Settings and switches, Global functions](#Part-0-:-Imports,-Settings,-Functions)**  
* import libraries  
* settings for number of cross validations  
* define functions that are used often

* **[Part 1: Exploratory Data Analysis](#Part-1:-Exploratory-Data-Analysis)**  
1.1 Get an overview of the features (numerical and categorical) and first look on the target variables questions and answers
[shape, info, head and describe](#shape,-info,-head-and-describe)  
[Distribution of the target variable Q](#The-target-variable-:-Distribution-of-questions-and-answers)  
[Numerical and Categorical features](#Numerical-and-Categorical-features)  
[List of features with missing values](#List-of-features-with-missing-values) and Filling missing values using [log transform](#log-transform)  
1.2 Relation of all features to target questions and answers  
[Seaborn regression plots for numerical features](#Plots-of-relation-to-target-for-all-numerical-features)  
[List of numerical features and their correlation coefficient to target](#List-of-numerical-features-and-their-correlation-coefficient-to-target)  
[Seaborn boxplots for categorical features](#Relation-to-questions-and-answers-for-all-categorical-features)  
[List of categorical features and their unique values](#List-of-categorical-features-and-their-unique-values)  
1.3 Determine the columns that show strong correlation to target  
[Correlation matrix 1](#Correlation-matrix-1) : all numerical features determine features with largest correlation to questions and answers

* **[Part 2: Data wrangling](#Part-2:-Data-wrangling)**  
[Dropping all columns with weak correlation to questions and answers](#Dropping-all-columns-with-weak-correlation-to-questions-and-answers)  
[Convert categorical columns to numerical](#Convert-categorical-columns-to-numerical)  
[Checking correlation to SalePrice for the new numerical columns](#Checking-correlation-to-questions-and-answers-for-the-new-numerical-columns)  
use only features with strong correlation to target  
[Correlation Matrix 2 (including converted categorical columns)](#Correlation-Matrix-2-:-All-features-with-strong-correlation-to-questions-and-answers)  
Create datasets for ML algorithms:                                                                          
[OneHotEncoder](#OneHotEncoder)  
[StandardScaler](#StandardScaler)

* **[Part 3: Scikit-learn basic regression models and comparison of results](#Part-3:-Scikit-learn-basic-regression-models-and-comparison-of-results)**  
implement GridsearchCV with RMSE metric for Hyperparameter tuning for these models from sklearn:  
[Linear Regression](#Linear-Regression)  
[Ridge](#Ridge)  
[Lasso](#Lasso)  
[Elastic Net](#Elastic-Net)  
[Stochastic Gradient Descent](#SGDRegressor)  
[DecisionTreeRegressor](#DecisionTreeRegressor)  
[Random Forest Regressor](#RandomForestRegressor)  
[KNN Regressor](#KNN-Regressor)  
Baed on RMSE metric, compare performance of the regressors with their optimized parameters, then explore correlation of the predictions and make submission with mean of best models plot comparison:             
[RMSE of all models](#Comparison-plot:-RMSE-of-all-models)  
[Correlation of model results](#Correlation-of-model-results)  
Mean of best models


Note on scores:  
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed questions and answers. (Taking logs means that errors in predicting questions and answers will affect the result equally.)

# Part 0 : Imports, Settings, Functions

In [None]:
#Visualizing Lib
import seaborn as sns
import matplotlib.pyplot as plt

#Math Lib for some statistics
from scipy import stats
%matplotlib inline
sns.set()

# df preprocessing Lib
import pandas as pd
import numpy as np
pd.set_option('max_columns', 105)

# AI preprocessing lib
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split 

# ML Lib
from sklearn.linear_model import ElasticNetCV, LassoCV, RidgeCV
import sklearn.linear_model as linear_model
from sklearn.svm import SVR
from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from mlxtend.regressor import StackingCVRegressor

# warning supressor
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
#warnings.filterwarnings("ignore")
#importing necessary models and libraries

**Settings and switches**

**Here we can choose settings for optimal performance and runtime. For example, nr_cv sets the number of cross validations used in GridsearchCV, and min_val_corr is the minimum value for the correlation coefficient to the target (only features with larger correlation will be used).** 

In [None]:
# setting the number of cross validations used in the Model part 
nr_cv = 5

# switch for using log values for SalePrice and features     
use_logvals = 1    
# target used for correlation 
target_1 = 'questions'
target_2 = 'answers'    
# only columns with correlation above this threshold value  
# are used for the ML Regressors in Part 3
min_val_corr = 0.4    
    
# switch for dropping columns that are similar to others already used and show a high correlation to these     
drop_similar = 1

**Initiate functions:**

In [None]:
def get_best_score(grid):
    
    best_score = np.sqrt(-grid.best_score_)
    print(best_score)    
    print(grid.best_params_)
    print(grid.best_estimator_)
    return best_score

def print_cols_large_corr(df, nr_c, targ) :
    corr = df.corr()
    corr_abs = corr.abs()
    print (corr_abs.nlargest(nr_c, targ)[targ])

def plot_corr_matrix(df, nr_c, targ) :
    
    corr = df.corr()
    corr_abs = corr.abs()
    cols = corr_abs.nlargest(nr_c, targ)[targ].index
    cm = np.corrcoef(df[cols].values.T)

    plt.figure(figsize=(nr_c/1.5, nr_c/1.5))
    sns.set(font_scale=1.25)
    sns.heatmap(cm, linewidths=1.5, annot=True, square=True, 
                fmt='.2f', annot_kws={'size': 10}, 
                yticklabels=cols.values, xticklabels=cols.values
               )
    plt.show()

In [None]:
from subprocess import check_output, call 
print(check_output(["dir", "C:\\Users\\alamo248\\Downloads\\Data\\all_uptt.csv"],shell=True).decode("utf8"))

In [None]:
#Importing data into dataframe
df =  pd.read_csv('C://Data/all_upt.csv')

In [None]:
#Displaying dataframe
df.head()

In [None]:
# subtitute missing values by zero
#df = df.fillna(0)
#Displaying modified dataframe
#df.head()

In [None]:
## creating a variable vector
df = df.iloc[:,[0,6,7,8,9,10,11,12,13,14,15,16,17]]
# Or
# vec = df.loc[:,['Id','DisplayName','Location','country','AboutMe_length','activity_in_months','UpVotes','DownVotes','Reputation','Views','badges','Q_comments','A_comments','P_questions','P_answers']]

In [None]:
df.head()

In [None]:
# Create training and testing sets
df_train,df_test= train_test_split(df, test_size = 0.2, random_state = 0)

# Part 1: Exploratory Data Analysis

## 1.1 Overview of features and relation to target

Let's get a first overview of the train and test dataset
* How many rows and columns are there?  
* What are the names of the features (columns)?  
* Which features are numerical, which are categorical?  
* How many values are missing?  
The **shape** and **info** methods answer these questions. Whereas, the **head** displays some rows of the dataset **describe** gives a summary of the statistics (only for numerical columns)

### Shape, Info, Head & Describe -----> Functions

In [None]:
print('-'*100)
print('training sample size')
print(df_train.shape)
print('-'*100)
print('testing sample size')
print(df_test.shape)
print('-'*100)
print('training sample features description')
print(df_train.info())
print('-'*100)
print('testing sample features description')
print(df_train.info())
print('-'*100)

* It seems like the trainning and tesing dataframe *(df)* vector *(vec)* consists of 13 columns (12 features excluding Id), as for the training df vec, it has 112147 entries (number of rows). On the other hand,  df test vec has 28037 entries.  
* There are lots of info that is probably related to the dependent variables (target) questions and answers such as badges, reputaion, etc...   
* Maybe other features are not so important for predicting the target, also there might be a strong correlation for some of the features (like activity_in_month).
* There are missing in some columns and it seems some countries tend to have more missing data than others, we are going to deal with missing data accordingly in a later stage

In [None]:
# displaying a sample from the training dataframe
df_train.head()

In [None]:
# displaying a descriptive stats regarding the training dataframe 
df_train.describe()

In [None]:
# displaying a sample from the testing dataframe
df_test.head()

In [None]:
# displaying a descriptive stats regarding the testing dataframe
df_test.describe()

## Distribution of target variables (Questions and Answers)

In [None]:
## optimising plots size
# data = np.random.normal(0, 1, 3)
# array([-1.18878589,  0.59627021,  1.59895721])
# ploty = plt.figure(figsize=(20, 15))
# sns.boxplot(x=data);

In [None]:
# ~ conversion error prevention **
# df_train = df_train.fillna(0)
# Seaborn 0_0
print('-'*100)
print('-'*100)
plt.figure(figsize=(20, 15))
sns.distplot(df_train['P_questions'].dropna());
# skewness and Kurtosis
print("Skewness: %f" % df_train['P_questions'].skew());
print("Kurtosis: %f" % df_train['P_questions'].kurt());
print('-'*100)
print('-'*100)
# ValueError: cannot convert float NaN to integer --- Error --- convertion is required ---> utilising fillna in early phase

In [None]:
print('-'*100)
print('-'*100)
plt.figure(figsize=(20, 15))
sns.distplot(df_train['P_answers'].dropna());
# skewness and Kurtosis
print("Skewness: %f" % df_train['P_answers'].skew());
print("Kurtosis: %f" % df_train['P_answers'].kurt());
print('-'*100)
print('-'*100)

* As we can see, the target variable for both questions and answers is not normally distributed. 
* This behaviour can leads to performance reduction in the ML regression modeling due the fact that some models assume normal distribution.
* Therfore a log transformation is required(see sklearn info on preprocessing) to enhance distribution visualisation.

In [None]:
np.seterr(divide = 'ignore')
# vec = df_train.loc[:,['P_questions']]
df_train['Questions_log'] = np.where(df_train.loc[:,['P_questions']]>0, np.log(df_train.loc[:,['P_questions']]), 0)

# # alternative implementation -- a bit more typing but avoids warnings.
# loc = np.where(myarray>0)
# result2 = np.zeros_like(myarray, dtype=float)
# result2[loc] =np.log(myarray[loc])

# # answer from Enrico...
# myarray= np.random.randint(10,size=10)
# result = np.where(myarray>0, np.log(myarray), 0)

# # check it is giving right solution:
# print(np.allclose(result, result2))

In [None]:
df_train = pd.concat([df_train, Questions_log], axis=1)

In [None]:
df_train.head()

In [None]:
# # creating a log feature for dependent varaiables
# df_train['Questions_log'] = np.log(df_train['P_questions'])
# df_train['Answers_log'] = np.log(df_train['P_answers'])

In [None]:
# df_train['Questions_log']

In [None]:
# plt.figure(figsize=(20,15))
# sns.distplot(df_train['Questions_log'].dropna())
# #sns.distplot(df_train['P_answers'].dropna());

In [None]:
# # create the independent variable vector
# x = df.iloc[:,6:15].values
# # Question dependent variable vector 
# y = df.iloc[:, -2:-1].values
# # Answer dependent variable vector 
# z = df.iloc[:,-1:].values

In [None]:
# # create labelEncoder object to transform categorical values into integers
# x[:, 0] = LabelEncoder().fit_transform(x[:, 0])
# y = LabelEncoder().fit_transform(y)
# z = LabelEncoder().fit_transform(z)

In [None]:
# # creating OneHotEncoder object to transform integer categorical values into dummy categorical
# x = OneHotEncoder(categorical_features=[0]).fit_transform(x).toarray()

In [None]:
# # Create training and testing sets
# x_train,x_test,y_train,y_test,z_train,z_test = train_test_split(x,y,z, test_size = 0.2, random_state = 0)

In [None]:
# # feature scaling
# sc_x = StandardScaler()
# x_train = sc_x.fit_transform(x_train)
# x_test = sc_x.transform(x_test)