# KAGGLE MACHINE LEARNING NOTEBOOK #
##### _Proceduce guide to solve kaggle competiton_

## 0. Preface
<a id='1'></a>

1. <b>Understand the problem</b>. We'll look at each variable and do a philosophical analysis about their meaning and importance for this problem.
2. <b>Univariable study</b>. We'll just focus on the dependent variable ('SalePrice') and try to know a little bit more about it.
3. <b>Multivariate study</b>. We'll try to understand how the dependent variable and independent variables relate.
4. <b>Basic cleaning</b>. We'll clean the dataset and handle the missing data, outliers and categorical variables.
5. <b>Test assumptions</b>. We'll check if our data meets the assumptions required by most multivariate techniques.

### Summary

[1]. <b>Exploratory data analysis </b>.

[1.1] General
- ProfileReport()
- info()
- value_count()
- describe()
- hist()

[1.2] EDA

- Visualize output: 

      .plot()
- Correllation: 
        .corr()
        .scatter_matrix()
      
[1.3] Attribute Combinations

[2] <b> Data Cleaning </b>

[2.1] Numerical variables

[2.1.1] Missing data
- Analyzing
- dropna()
- drop()
- fillna()
- SimpleImputer()

[2.1.2] Outliers

[2.1.3] Contaminated data

[2.1.4] Invalid data

[2.1.5] Duplicate data

- drop_duplicate()


[2.2] Categorical Variables

[2.2.1] Encoder

- OrdinalEncoder()
- OneHotEncoder()

[2.2.2] Inconsistent data

[2.2.3] Datatype issues


[2.4] Feature Scaling

- MinMaxScaler()
- StandardScaler()


    


        


[To some Internal Section](#1)




In [None]:
import pandas as pd
from pandas_profiling import  ProfileReport
import numpy as np

# Load data on github
titanic_path = "https://raw.githubusercontent.com/thanhtam98/Kaggle-machine-learnin-note/main/data/titanic/"
titanic_df_train = pd.read_csv(titanic_path + "train.csv")
housing_path = "https://raw.githubusercontent.com/thanhtam98/Kaggle-machine-learnin-note/main/data/housing/"
housing_df_train = pd.read_csv(housing_path + "train.csv")


In [None]:
# Split num and cat
titanic_df_train_num = titanic_df_train.select_dtypes(exclude=["object"])
titanic_df_train_cat = titanic_df_train.select_dtypes(exclude=[np.number])
print("Titanic numberic \n" + str(titanic_df_train_num.head))
print("Titanic categorical \n" + str(titanic_df_train_cat.head))

housing_df_train_num = housing_df_train.select_dtypes(exclude=["object"])
housing_df_train_cat = housing_df_train.select_dtypes(exclude=[np.number])
print("Housing numberic  \n" + str(housing_df_train_num.head))
print("Housing categorical  \n" + str(housing_df_train_cat.head))


# make housing data as main data
df_train = housing_df_train
df_train_num = housing_df_train_num


## 1. Exploratory data analysis


_Phải có cái nhìn tổng quan về giá trị output bên ngoài thực tiễn. Và chúng là người muốn hiểu biết về nó_

### 1.1 General



#### pandas_profiling ProfileReport 

In [None]:
import pandas as pd
from pandas_profiling import ProfileReport

titanic_df_train.head()
profile = ProfileReport(
    titanic_df_train, title="Pandas Profiling Report for Housing train dataset"
      )
profile.to_file("pandas_profiling_example.html")

#### pandas support function


In [None]:
import pandas as pd

titanic_df_train.describe()
titanic_df_train.info()
titanic_df_train.value_counts()
housing_df_train.describe()
housing_df_train.hist()


### 1.2 EDA


#### Visualize output

In [None]:
# Matplotlib
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd

df_train.head()
df_train.hist(bins=50, figsize=(20, 15));

In [None]:
# Seaborn
import seaborn as sns

sns.distplot(df_train['SalePrice']);

#### Relationship and Correllation

##### Relationship with numerical variables

In [None]:
#scatter plot grlivarea/saleprice
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

##### Relationship with  categorical features

In [None]:
import seaborn as sns

#box plot overallqual/saleprice
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

var = 'YearBuilt'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);

##### Correlation matrix heat map

In [None]:
#correlation matrix
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

#Todo: detail, discard the lower correlation  variables

In [None]:
# zoomed heatmap style

k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

##### Scatter plots




In [None]:
#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show();


## 2. Data clearning 

### 2.1 Numerical variables

#### 2.1.1 Missing data:
* How prevalent is the missing data?
* Is missing data random or does it have a pattern?

##### Analyzing

In [None]:
# Todo: add more function to analyze data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

##### Dealing with missing data (Missing data handling)

- **Drop**

In [None]:
#dealing with missing data
df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
df_train.isnull().sum().max()

- **Dropna**

In [None]:
df_train = df_train.dropna()
df_train.isnull().sum()

- **Fillna**

In [None]:
df_train = df_train.fillna(0)

- Imputation

In [None]:
from re import S
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer(strategy="most_frequent") # most_frequent, mean, median, constant
df_train_imputed = pd.DataFrame(my_imputer.fit_transform(df_train_num))
df_train_imputed.columns = df_train_num.columns


In [None]:
# OPTIONAL: COMPATE BEFORE AND AFTER IMPUTATION

# Define figure parameters
sns.set(rc={"figure.figsize": (14, 12)})
sns.set_style("whitegrid")
fig, axes = plt.subplots(2, 2)

# Plot the results
for feature, fig_pos in zip(["LotFrontage", "MasVnrArea"], [0, 1]):
    """Features distribution before and after imputation"""
    # before imputation
    p = sns.histplot(ax=axes[fig_pos, 0], x=df_train_num[feature],
                     kde=True, bins=30, color="dodgerblue", edgecolor="black")
    p.set_ylabel(f"Before imputation", fontsize=14)
    # after imputation
    q = sns.histplot(ax=axes[fig_pos, 1], x=df_train_imputed[feature],
                     kde=True, bins=30, color="firebrick", edgecolor="black")
    q.set_ylabel(f"After imputation", fontsize=14)


#### 2.1.2 Outlier


##### Univariate analysis

- Standardizing data

In [None]:

#standardizing data
from sklearn.preprocessing import StandardScaler
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)

##### Bivariate analysis

#### 2.1.3  Contaminated data


#### 2.1.4 Invalid data


#### 2.1.5 Duplicate data


### 2.2 Categorical Variables

#### 2.2.1 Encoder

- **OrdinalEncoder**

- **OneHotEncoder**

#### 2.2.2 Inconsistent data

#### 2.2.3 Datatype issues

### 2.3 Quasi-Constant variables

In [None]:
from sklearn.feature_selection import VarianceThreshold
# 0.05: drop column where 95% of the values are constant
sel = VarianceThreshold(threshold=0.05) 
# fit finds the features with constant variance
sel.fit(df_train_num.iloc[:, :-1])
# Get the number of features that are not constant
print(f"Number of retained features: {sum(sel.get_support())}")

print(f"\nNumber of quasi_constant features: {len(df_train_num.iloc[:, :-1].columns) - sum(sel.get_support())}")

quasi_constant_features_list = [x for x in df_train_num.iloc[:, :-1].columns if x not in df_train_num.iloc[:, :-1].columns[sel.get_support()]]

print(f"\nQuasi-constant features to be dropped: {quasi_constant_features_list}")
# Let's drop these columns from df_train_num
df_train_num.drop(quasi_constant_features_list, axis=1, inplace=True)

## 3. Feature transform and scaling 

- MinMaxScaler()

- StandardScaler()

## 4. Pipeline

In [None]:
# !jupyter nbconvert --to html note.ipynb

In [None]:
# !jupyter nbconvert --to pdf note.ipynb