# Hello Fellow kagglers !
## As the title says this tutorial is for  beginners every part of this kernel summarizes how to get started in ML competitions field and IRL data problems , hope you enjoy this kernel.
### Let's start !!

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
print(os.listdir("../input"))


Loading our data , usually done with **Pandas** lib

In [None]:
housing = pd.read_csv('../input/housing.csv')


In [None]:
housing.head()

In [None]:
housing.info()

> The count for each value in ocean_proximity column.

In [None]:
housing.ocean_proximity.value_counts()

In [None]:
housing.describe()

In [None]:
sns.set()
housing.isna().sum().sort_values(ascending=True).plot(kind='barh',figsize=(10,7))#Quick peak into the missing columns values
#Let's deal with that later on the cleaning part with various methods !

In [None]:
housing.hist(bins=50,figsize=(20,15))#The bins parameter is used to custom the number of bins shown on the plots.
plt.show()

In [None]:
from sklearn.model_selection import train_test_split
train_, test_ = train_test_split(housing,test_size=0.2,random_state=1)

> # EDA Time to have a look on our Data
    One good practice is to do EDA on the full data and creating a copy of it for not harming our test and training data.

In [None]:
plotter = housing.copy()

Since there is geographical information (latitude and longitude), it is a good idea to create a scatterplot of all districts to visualize the data

In [None]:
sns.set()
plt.figure(figsize=(10,8))#Figure size
plt.scatter('longitude','latitude',data=plotter)
plt.ylabel('Latitudes')
plt.xlabel('Longitudes')
plt.title('Geographical plot of Lats/Lons')
plt.show()

> #### The plot above look like california RIGHT ?![img](https://california.azureedge.net/cdt/CAgovPortal/images/Uploads/menu-living.jpg)

> But we don't have a **informative** look on the plot since we need to know the density for each point, let's do a simple modification.


In [None]:
sns.set()
plt.figure(figsize=(10,8))#Figure size
plt.scatter('longitude','latitude',data=plotter,alpha=0.1)
plt.ylabel('Latitudes')
plt.xlabel('Longitudes')
plt.title('Geographical plot of Lats/Lons')
plt.show()

> Now it's much better , and if we're familiar with Californias map we can see clearly that the high-density areas , namely the Bay Area and all around Los Angeles & San diego
More generally our brains can spot patterns visually , but we always need to play around with the vizualisations to make the patterns stands out.

In [None]:
plt.figure(figsize=(10,7))
plotter.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
        s=plotter["population"]/100, label="population", figsize=(15,8),
        c="median_house_value", cmap=plt.get_cmap("jet"),colorbar=True,
    )
plt.legend()

> Now we can say that the house price is a bit related to the location (e.g close to ocean) and to the density of the population.

In [None]:
corr_matrix=plotter.corr()
corr_matrix.median_house_value.sort_values(ascending=False)

> Checking the correlation between the main features with the Pandas function (Scatter_matrix) wich shows linear correlations between the features

In [None]:
from pandas import scatter_matrix
sns.set()
feat = ['median_house_value','median_income','total_rooms','housing_median_age']
scatter_matrix(plotter[feat],figsize=(15,8))

In [None]:
plt.figure(figsize=(12,7))
plt.scatter('median_income','median_house_value',data=plotter,alpha=0.1)
plt.xlabel('Median income')
plt.ylabel('Median house value')
plt.title('Linear correlation Median income/Median House value')

> **NB:** 
One last thing you may want to do before actually preparing the data for Machine Learning algorithms is to try out various attribute combinations. For example, the total number of rooms in a district is not very useful if you don’t know how many households there are. What you really want is the number of rooms per household.

In [None]:
plotter['rooms_per_household']= plotter.total_rooms/housing.households

In [None]:
plotter.head()

In [None]:
corr_matrix1=plotter.corr()
corr=corr_matrix1.median_house_value.sort_values(ascending=False)
d= pd.DataFrame({'Column':corr.index,
                 'Correlation with median_house_value':corr.values})
d

> Not bad haha ! The number of rooms per household is now more informative than the total number of rooms in a district

# Data cleaning

> Most Machine Learning algorithms cannot work with **missing features**, so let’s create a few functions to take care of them. You noticed earlier that the total_bedrooms attribute has some missing values, so let’s fix this. You have three options:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.)

Since we don't have a lot of data the first option won't be the best , the second one too because we need that feature , the wisest choice could be the median , we can't affect the mean because we have some outliers this will affect our training model.

> I'm commenting those options just to show you how to do them i won't use them in this tutorial

In [None]:
#plotter.dropna(subset=["total_bedrooms"]) # option 1 
#plotter.drop("total_bedrooms", axis=1) # option 2 
#median = plotter["total_bedrooms"].median() # option 3 
#plotter["total_bedrooms"].fillna(median, inplace=True)


> Scikit learn have a handy class to compute median , mean... strategies.
 We'll use that !

In [None]:
from sklearn.impute import SimpleImputer
imputer =SimpleImputer(strategy='median')#In this case its better to use the median to replace missing values

> If we run the code ( imputer.fit(data) ) we'll have an error since the imputer doesn't work on objects, and as shown at the very beginning we have a categorical attribute which is **"Ocean_proximity"** so we need to drop that.

In [None]:
ft_data = plotter.drop('ocean_proximity',axis=1)

In [None]:
imputer.fit(ft_data)


In [None]:
imputer.statistics_ #Here's the median of every attribute in our data !

In [None]:
ft_data.total_bedrooms.median()

> Now you can use this “trained” imputer to transform the training set by replacing missing values by the learned medians:

In [None]:
X = imputer.transform(ft_data)

> The result is a plain NumPy array containing the transformed features. We want to
put it back into a Pandas DataFrame, it’s simple:


In [None]:
ft_transformed = pd.DataFrame(X,columns=ft_data.columns)
ft_transformed.tail() #The missing values in total_bedrooms were replaced by the median value

> Let's handle our categorical data issue

In [None]:
obj_cols = housing.dtypes
obj_cols[obj_cols=='object']

In [None]:
sns.set(palette='Set2')
housing.ocean_proximity.value_counts().sort_values(ascending=True).plot(kind='barh',figsize=(10,7))
plt.legend()

> In this case i will one hot encode the labels, we got various encoders for categorical objects, label encoding, ordinal encoder...

In [None]:
from sklearn.preprocessing import OneHotEncoder
lab_encoder = OneHotEncoder()
cat_house = housing[['ocean_proximity']]
cat_enc = lab_encoder.fit_transform(cat_house)

> One of the most important transformation step to apply to your data is **Feature Scaling**
    >Because with some few exceptions, Machine learning algorithm won't perform well since we have different attributes scales, so what we want to do is to scale them , **note that target attribute doesn't have to be scaled**

> We have two common ways to get all the attributes to have the same scale
1. Min-Max Scaling.
    Many people call it Normalization and its quite simple , values are shifted and rescaled to be in a range of 0 and 1

![iz](https://i.imgur.com/FH9LCE6.png)

2. Standardization is a bit different, first it substracts the mean value so standardized values always have a zero mean, then it divides by the standard deviation so that the resulting distribution has unit variance, this is how we calculate standard deviation ( Écart Type )

![img](https://i.imgur.com/EFlEx48.png)

> N is the number of our samples, We sum the Squared difference from mean which means (X(i) - X̅) squared then we have our standard deviation, but dont worry we have a lot of ways compute all this, but it's always good to know what your computing.
To compute STD ( standard deviation ) we use numpy , exemple : to compute the STD for the median_income we only have to do this --> np.std(data['median_income'])

> ### Anyway as we showed we need a lot of transformations but thanks to scikit learn that provides a **Pipeline** class to help with such transformations link here : [Pipeline doc'](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

> # NOW you're ready to go and start training your model on the train set and test it's accuracy on the test set that we created with the train_test_split function !

## Thank you for reading !
    If you found this helpful an upvote would be very much appreciated 