In [1]:
import pandas as pd
import numpy as np

# Read Data

In [2]:
housing = pd.read_csv('housing.csv')

## Data exploring

In [None]:
housing.head()

In [None]:
housing.info()

In [None]:
housing.describe() 

In [None]:
print(housing["ocean_proximity"].value_counts())

## Spiliting the data

### train_test_split
Randomly split the data

In [3]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state = 42) 

# keeing 20% for test set

### stratified sampling
the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population

In [None]:
# page: 57
from sklearn.model_selection import StratifiedShuffleSplit

housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 
for train_index, test_index in split.split(housing, housing["income_cat"]):
        strat_train_set = housing.loc[train_index]
        strat_test_set = housing.loc[test_index]

# Data cleaning

Remove unnecessary columns

In [None]:
housing = housing_df.drop(["column_name"], axis=1)

Seperate Numerical and categorical data

In [26]:
housing_cat = housing[["ocean_proximity"]]
housing_target = housing["median_house_value"]
housing_num = housing.drop(["ocean_proximity","median_house_value"], axis=1)

In [None]:
print(housing_num.head())
print(housing_cat.head())
print(housing_target.head())

## Handling missing values

Manually filling missing values in a column with the median value

In [None]:
housing["total_bedrooms"].fillna(housing["total_bedrooms"].median(), inplace=True)

### SimpleImputer 
- Scikit-Learn provides a handy class to take care of missing values: SimpleImputer.
- Create a SimpleImputer instance, specifying that we want to replace each attribute’s missing values with the median of that attribute.
- median can only be computed on numerical attributes, we need to create a copy of the data without the text attribute(categorical data)

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

- Now we can fit the imputer instance to the training data using the fit() method
- we cannot be sure that there won’t be any missing values in new data after the system goes live, so it is safer to apply the imputer to all the numerical attributes.

In [None]:
imputer.fit(housing_num)

The imputer has simply computed the median of each attribute and stored the result in its statistics_

In [None]:
imputer.statistics_

- Now we can use this “trained” imputer to transform the training set by replacing missing values by the learned medians.

In [None]:
X = imputer.transform(housing_num)

- The result is a plain NumPy array containing the transformed features. To put it back into a Pandas DataFrame use pd.DataFrame

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

imputer is a scikit learn Estimator so it has the fit method. 

Also it is transform so it has the transform() method.

For transformers we can use the fit_transform() method to fit and transform the data simultaneously. 
**fit_transform()** is the optimized method.

In [None]:
X = imputer.fit_transform(housing_num)

## Handling categorical data

- ordinal encoding
- Onehot Encoding

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

In [None]:
ordinal_encoder.categories_

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

In [None]:
housing_cat_1hot.toarray()

In [None]:
housing_cat_1hot_df = pd.DataFrame(housing_cat_1hot.toarray(), columns=cat_encoder.get_feature_names_out(["ocean_proximity"]))

In [None]:
housing_cat_1hot_df.describe()

series vs dataframe in pandas df["column"] vs df[["column"]]

In Pandas, the behavior you've described is intentional and follows the design of the DataFrame structure.

When you use single square brackets `df["column"]`, you are indexing a single column, and the result is a Pandas Series. A Series is essentially a one-dimensional labeled array, and it retains the index of the original DataFrame.

On the other hand, when you use double square brackets `df[["column"]]`, you are indexing with a list of column names, even if there is only one column in the list. This syntax is designed to return a DataFrame with the specified column(s). The result is a DataFrame with one or more columns, and it retains the DataFrame structure.

Here's a simple example to illustrate:

```python
import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Single square brackets return a Series
series_result = df['A']
print(type(series_result))  # <class 'pandas.core.series.Series'>

# Double square brackets return a DataFrame
df_result = df[['A']]
print(type(df_result))  # <class 'pandas.core.frame.DataFrame'>
```

In both cases, you can access the data within the Series or DataFrame using standard Pandas operations. However, the choice between a Series and a DataFrame depends on your specific use case and the structure of the data you are working with.

# Feature Scaling

Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. 
Feature scaling is one of the most important transformations needed to apply to the data.
- min-max scaling
- standardization
  
## min-max scaling (normalization)
- values are shifted and rescaled so that they end up ranging from 0 to 1.
- subtract the min value and divide by the max minus the min.
$$(xi-min)/(max-min)$$

## Standardization
- first subtract the mean value (so standardized values always have a zero mean)
- then divide by the standard deviation (so that the resulting distribution has unit variance).
- Unlike min-max scaling, standardization does not bound values to a specific range.
- standardization is much less affected by outliers
$$(xi-mean)/standard deviation$$
- scikit learn provide a transformer for that: **StandardScaler**

# Transformation Pipeline
- The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. 
- All but the last estimator must be transformers (i.e., they must have a fit_transform() method). 
- The names can be anything you like (as long as they are unique and don’t contain double underscores “__”)


In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# pipeline to handle numerical values
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
])

# transforming titanic_num data frame to use mean value in place of missing values
housing_num_tr = num_pipeline.fit_transform(housing_num)

In [None]:
housing_num_tr_df = pd.DataFrame(housing_num_tr, columns=housing_num.columns)

## Transform categorical data and numerical data in single transformer

The constructor requires,
- a list of tuples where each tuple contains,
- a name,
- a transformer,
- and a list of names (or indices) of columns that the transformer should be applied to.

In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_attribs = list(housing_num)
cat_attribs = list(housing_cat)

full_pipeline = ColumnTransformer([ 
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_attribs), 
]) 
# returns a matrix
# handle_unknown is used to determine what to do if encounter unknown categories.
# This may happen if the test set does not contain some categories present in training set

housing_prepared = full_pipeline.fit_transform(housing)

Note that the OneHotEncoder returns a sparse matrix, while the num_pipeline returns a dense matrix. When there is such a mix of sparse and dense matrices, the Colum nTransformer estimates the density of the final matrix (i.e., the ratio of non-zero cells), and it returns a sparse matrix if the density is lower than a given threshold (by default, sparse_threshold=0.3).

# Training model and Evaluating on training set

### Linear Regression

In [9]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_target)

In [10]:
some_data = housing.iloc[:5]
some_target = housing_target.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print(lin_reg.predict(some_data_prepared))
print(some_target)

[[408504.]
 [424036.]
 [378476.]
 [321124.]
 [255856.]]
   median_house_value
0            452600.0
1            358500.0
2            352100.0
3            341300.0
4            342200.0


### Root mean squared error

In [12]:
from sklearn.metrics import mean_squared_error as mse
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mse(housing_target, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

68709.30331593033

#### python function to calculate root mean squared error

In [None]:
import math
def linear(w,b,x):
    return w*x + b

def squared_error(w,b,x,y_real):
    return pow(( y_real - linear(w,b,x)), 2 )

def rmse(w,b,X_array, Y_array):
    count = len(X_array)
    sum=0
    for i in range(count):
        sum += squared_error(w,b,X_array[i],Y_array[i])
    mean_squared_error = sum/count
    root_mean_squared_error = math.sqrt(mean_squared_error)
    
    return root_mean_squared_error

w=1
b=2
x = [10,20]
y = [112,210]

print(rmse(w,b,x,y))

### Decision Tree Regression

In [19]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_target)

In [15]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mse(housing_target, housing_predictions)
tree_rsme = np.sqrt(tree_mse)
tree_mse

0.0

### Evaluation using train test split

- split the data in training set and testing set
- train the model using training set and test the model using testing set

In [21]:
from sklearn.model_selection import train_test_split

train_set_features, test_set_features, train_set_target, test_set_target = train_test_split(housing_prepared, housing_target, test_size=0.2, random_state = 42)

tree_reg = DecisionTreeRegressor()
tree_reg.fit(train_set_features, train_set_target)

def rmse(model, test_features, test_target):
    predictions = model.predict(test_features)
    model_mse = mse(test_target, predictions)
    model_rsme = np.sqrt(model_mse)
    return model_rsme

rmse(tree_reg, test_set_features, test_set_target)

69055.08189486346

### Cross validation

**Scikit-Learn’s K-fold cross-validation:** The follow‐ ing code randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array con‐ taining the 10 evaluation scores

- Scikit-Learn’s cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a neg‐ ative value), which is why the preceding code computes -scores before calculating the square root.


In [33]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_target,
                        scoring="neg_mean_squared_error", cv=10)

tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)
    

Scores: [137059.67894404  69535.36810686 103453.23430445  75031.00104527
  87203.79266311  84766.16955271  64725.67215609 108161.58665932
 107990.9335493   74734.45393694]
Mean: 91266.18909181032
Standard deviation: 21407.942116839116


### Random Forest Regression

Random Forests work by training many Decision Trees on random subsets of the features, then averaging out their predictions. Building a model on top of many other models is called **Ensemble Learning**, and it is often a great way to push ML algo‐ rithms even further.

In [27]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_target)

In [34]:
rf_scores = cross_val_score(forest_reg, housing_prepared, housing_target, 
                           scoring="neg_mean_squared_error", cv=10)
rf_rmse_scores = np.sqrt(-rf_scores)

rsme(forest_reg, housing_prepared, housing_target)
display_scores(rf_rmse_scores)

Scores: [107238.55240878  49057.6342153   68544.86827284  60191.55765018
  61912.18059241  66470.64024908  48762.23824707  85538.93641659
  81149.61186529  54851.54480853]
Mean: 68371.77647260725
Standard deviation: 17381.783892833573


# Remaining topics

- find corelations
- fine tuning
   - grid search
   - randomized search
