# Welcome!

> whatever you study, you will never learn until you see how it works practically. It's very important to work on real-life example on whatever you are studying.

* In this notebook, i will try to walk you through a real machine learning project.

Steps in ML project.
1. Get the data.
2. Discover and visualize the data to gain insights.
3. Prepare the data for Machine Learning algorithms.
4. Select a model and train it.
5. Fine-tune your model.
6. Present your solution.

In this notebook we are using California's housing data to predict housing prices.
As you can see it's a regression task.

# Get the Data

In [None]:
# Download the data
import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [None]:
fetch_housing_data()

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [None]:
# take a quick look at the data and it's stats.
housing= load_housing_data()
housing.head()

In [None]:
# to get quick description of data.
housing.info()

1. total_bedrooms attribute have missing values.
2. ocen_proximity if a categorical attribute.

In [None]:
# number of categories that exists in ocean_proximity
housing['ocean_proximity'].value_counts()

In [None]:
# summary of numerical attributes.
housing.describe()

let's take a quick look at the data distribution.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

> we notice a few things in this:
* First, the median income attribute does not look like it is expressed in US dollars(USD). The data is scaled.The numbers represent roughly tens of thousands of dollars (e.g., 3 actually means about 30,000)
* The housing median age and the median house value were also capped.Your Machine Learning algorithms may learn that prices never go beyond that limit.
* Finally, many histograms are tail heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will try transforming these attributes later on to have more bell-shaped distributions.

In [None]:
# Creation of training and test set.
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
test_set.head()

In [None]:
housing['median_income'].hist()
plt.show()

Let’s look at the median income histogram more closely most median income values are clustered around 1.5 to 6 (i.e.15,000–60,000), but some median incomes go far beyond 6. It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of the stratum’s importance may be biased. This means that you should not have too many strata, and each stratum should be large enough. The following code uses the pd.cut() function to create an income category attribute with 5 categories (labeled from 1 to 5): category 1 ranges from 0 to 1.5 (i.e., less than 15,000), category 2 from
1.5 to 3, and so on.

In [None]:
import numpy as np
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
housing["income_cat"].value_counts()

In [None]:
housing['income_cat'].hist()
plt.show()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split= StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# https://www.kaggle.com/dhirajnirne/stratified-sampling
* visit here to know the importance and use of stratified sampling.

In [None]:
# lets see if it worked or not
strat_test_set['income_cat'].value_counts()/ len(strat_test_set)

In [None]:
# Now you should remove the income_cat attribute so the data is back to its original state.
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# Discover and visualize the data to gain insights

In [None]:
# let's create copy of the dataset to play with it
housing= strat_train_set.copy()

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")
plt.show()

In [None]:
# it's hard to see any pattern here let's reduce alpha
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)
plt.show()

In [None]:
# let's make it clearer
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing["population"]/100, label="population", figsize=(10,7),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
             sharex=False)
plt.legend()
plt.show()
# The radius of each circle represents the district’s population (option s), and the color represents the price (option c).
# We will use a predefined color map (option cmap) called jet, which ranges from blue(low values) to red (high prices).

This image tells you that the housing prices are very much related to the location (e.g., close to the ocean) and to the population density.


In [None]:
# let's look for correlations
corr_matrix= housing.corr()

In [None]:
#lets see the correlation with median_house_value
corr_matrix['median_house_value'].sort_values(ascending=False)

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that
there is a strong positive correlation; for example, the median house value tends to go
up when the median income goes up. When the coefficient is close to –1, it means
that there is a strong negative correlation; you can see a small negative correlation
between the latitude and the median house value (i.e., prices have a slight tendency to
go down when you go north).

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.show()

> Now, you have seen correlations between different features. But, sometimes what happens is that a attribute may not have corrletion with the target but a combination of two or more attributes could have a impact on the target so now look for such combinations:

In [None]:
# EXPERIMENTING WITH ATTRIBUTE COMBINATIONS
# the total number of rooms in a district is not very useful if you don’t know how many households there are.
# What you really want is the number of rooms per household.
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
#now lets look at the correlation matrix
corr_matrix= housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

The new bedrooms_per_room attribute is much more correlated with
the median house value than the total number of rooms or bedrooms. Apparently
houses with a lower bedroom/room ratio tend to be more expensive. The number of
rooms per household is also more informative than the total number of rooms in a
district—obviously the larger the houses, the more expensive they are.

In [None]:
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()

In [None]:
housing.describe()

# Prepare the data for Machine Learning Algorithm

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

you should compute the median value on the training set, and
use it to fill the missing values in the training set, but also don’t forget to save the
median value that you have computed. You will need it later to replace missing values
in the test set when you want to evaluate your system, and also once the system goes
live to replace missing values in new data.

In [None]:
# DATA Cleaning
# we will fill the the numerical missing values with their medians.
# Scikit-Learn provides a handy class to take care of missing values: SimpleImputer
from sklearn.impute import SimpleImputer
imputer= SimpleImputer(strategy='median')

In [None]:
# DATA Cleaning
# we will fill the the numerical missing values with their medians.
# Scikit-Learn provides a handy class to take care of missing values: SimpleImputer
from sklearn.impute import SimpleImputer
imputer= SimpleImputer(strategy='median')

In [None]:
#since median can only be computed on numerical attributes.
housing_num= housing.drop('ocean_proximity', axis=1)

In [None]:
imputer.fit(housing_num)

In [None]:
imputer.statistics_

In [None]:
#checking if it is same as the median
housing_num.median().values

In [None]:
X= imputer.transform(housing_num)

In [None]:
# HANDLING CATEGORICAL ATTRIBUTES
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

In [None]:
# By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray() method 
# or by setting 'sparse' attribute to False
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

Although Scikit-Learn provides many useful transformers, you will need to write
your own.

Let's create a custom transformer to add extra attributes:

In [None]:
#CUSTOM TRANSFORMATIONS
from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

In [None]:
# TRANSFORMATION PIPELINES
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)
# The pipeline exposes the same methods as the final estimator. In this example, the last estimator is a StandardScaler,
# which is a transformer, so the pipeline has a transform() method that applies all the transforms to the data in sequence 
#(and of course also a fit_transform() method, which is the one we used).

In [None]:
# we have handled the categorical columns and the numerical columns separately. It would be more convenient to have a single transformer able to 
# handle all columns, applying the appropriate transformations to each column.
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

In [None]:
housing_prepared

# This was the first part of the notebook. STAY TUNED FOR THE NEXT ONE.

**If you have any questions, kindly put it into comments, and please upvote if you find this imformatiove.**

credits= hands on machine learning(book)