## Regression

Columns:
- <b> Age: </b> Age in years
- <b> KM: </b> Accumulated Kilometers on odometer
- <b> FuelType: </b> Fuel Type (Petrol, Diesel, CNG)
- <b> HP: </b> Horse Power
- <b> MetColor: </b> Metallic Color? (Yes=1, No=0)
- <b> Automatic: </b> Automatic ( (Yes=1, No=0)
- <b> CC: </b> Cylinder Volume in cubic centimeters
- <b> Doors: </b> Number of doors
- <b> Weight: </b> Weight in Kilograms
- <b> Price: </b> Offer Price in EUROs

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from collections import Counter
from IPython.core.display import display, HTML
sns.set_style('darkgrid')

In [None]:
# Reading the ToyotaCorolla dataset from a CSV file into a pandas DataFrame
df = pd.read_csv('./data/ToyotaCorolla.csv')

# Displaying the first few rows of the DataFrame
df.head()


In [None]:
df.count()

In [None]:
df.describe()

## Data Preprocessing and Visualization

In [None]:
# Checking for missing/null values in the DataFrame "df"
# Using the `isnull()` method on the DataFrame, which returns a boolean DataFrame
# Summing the number of True values (missing/null) for each column using the `sum()` method
df.isnull().sum()


In [None]:
corr = df.corr()
#Plot figsize
fig, ax = plt.subplots(figsize=(8, 8))
#Generate Heat Map, allow annotations and place floats in map
sns.heatmap(corr, cmap='magma', annot=True, fmt=".2f")
#Apply xticks
plt.xticks(range(len(corr.columns)), corr.columns);
#Apply yticks
plt.yticks(range(len(corr.columns)), corr.columns)
#show plot
plt.show()

In [None]:
f, axes = plt.subplots(2, 2, figsize=(12,8))

sns.regplot(x = 'Price', y = 'Age', data = df, scatter_kws={'alpha':0.6}, ax = axes[0,0])
axes[0,0].set_xlabel('Price', fontsize=14)
axes[0,0].set_ylabel('Age', fontsize=14)
axes[0,0].yaxis.tick_left()

sns.regplot(x = 'Price', y = 'KM', data = df, scatter_kws={'alpha':0.6}, ax = axes[0,1])
axes[0,1].set_xlabel('Price', fontsize=14)
axes[0,1].set_ylabel('KM', fontsize=14)
axes[0,1].yaxis.set_label_position("right")
axes[0,1].yaxis.tick_right()

sns.regplot(x = 'Price', y = 'Weight', data = df, scatter_kws={'alpha':0.6}, ax = axes[1,0])
axes[1,0].set_xlabel('Price', fontsize=14)
axes[1,0].set_ylabel('Weight', fontsize=14)

sns.regplot(x = 'Price', y = 'HP', data = df, scatter_kws={'alpha':0.6}, ax = axes[1,1])
axes[1,1].set_xlabel('Price', fontsize=14)
axes[1,1].set_ylabel('HP', fontsize=14)
axes[1,1].yaxis.set_label_position("right")
axes[1,1].yaxis.tick_right()
axes[1,1].set(ylim=(40,160))

plt.show()

In [None]:
f, axes = plt.subplots(1,2,figsize=(14,4))

sns.distplot(df['KM'], ax = axes[0])
axes[0].set_xlabel('KM', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].yaxis.tick_left()

sns.scatterplot(x = 'Price', y = 'KM', data = df, ax = axes[1])
axes[1].set_xlabel('Price', fontsize=14)
axes[1].set_ylabel('KM', fontsize=14)
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()

plt.show()

In [None]:
df.head()

In [None]:
# Encoding categorical variables using one-hot encoding
# The `get_dummies()` function from the pandas library is used to convert categorical variables into binary columns
# The function takes a DataFrame `df` as input and returns a new DataFrame with one-hot encoded columns
df = pd.get_dummies(df)


In [None]:
df.head()

In [None]:
# Splitting the data into features (X) and target variable (y)
# The 'Price' column is dropped from the DataFrame to create the feature matrix X
X = df.drop('Price', axis=1).values

# The target variable is assigned to a separate variable y
# Here, the first column of the DataFrame 'df' is selected using iloc[:, 0]
# The values are then reshaped to have a single column using reshape(-1, 1)
y = df.iloc[:, 0].values.reshape(-1, 1)


In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [None]:
print("Shape of X_train: ",X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ",y_train.shape)
print("Shape of y_test",y_test.shape)

## Regression Models

### Linear Regression

In [None]:
# Importing the LinearRegression class from the scikit-learn library
from sklearn.linear_model import LinearRegression

# Creating an instance of the LinearRegression class
regressor_linear = LinearRegression()

# Training the linear regression model on the training data
# The fit() method is used to train the model by fitting it to the feature matrix X_train and the target variable y_train
regressor_linear.fit(X_train, y_train)

In [None]:
print(regressor_linear.score(X_train, y_train))
print(regressor_linear.score(X_test, y_test))

Cross-validation is a widely used technique to assess the generalization performance of regression models (or other predictive models). It helps in understanding how the results of a statistical analysis will generalize to an independent data set. Here's an explanation of how it works:

1. Partition the Dataset
The dataset is usually divided into 'k' equal-sized 'folds' or 'subsets'. A common choice is k=10, known as 10-fold cross-validation.

2. Training and Validation Process
For each of the 'k' folds:

Training: (k-1) folds are combined to form a training set, and the model is trained on this combined data.
Validation: The remaining one fold (left out fold) is used as a validation set to test the model.
Performance Metric: The error metric (like Mean Squared Error for regression) is computed for this iteration.
3. Repeat the Process
This process is repeated k times, with each of the k subsets serving exactly once as the validation data.

4. Average the Errors
The k results from the folds can then be averaged to produce a single estimation of performance. This helps in reducing the bias, as we are using most of the data for fitting, and also in reducing the variance, as most of the data is also being used in validation.

5. Assess the Model
The average error is used to assess the model's quality. This gives a more accurate estimate of how well the model has been trained to unseen data.

6. Final Model Training
Once the cross-validation process is complete and the best hyperparameters are selected, the final model is usually trained on the entire dataset before making predictions on new/unseen data.

Use in Hyperparameter Tuning
Cross-validation can also be used in conjunction with grid search or other search algorithms to find the optimal hyperparameters for the model.

Advantages and Disadvantages
Advantages: More reliable estimate of out-of-sample performance compared to train-test split.
Disadvantages: Computationally more expensive as it requires fitting and predicting k times.

In [None]:
# Importing the necessary library or module for cross-validation
from sklearn.model_selection import cross_val_score

# Predicting the cross-validation score for the test set results
# The cross_val_score() function is used to evaluate the performance of the model using cross-validation
# The estimator parameter takes the trained regression model 'regressor_linear' as input
# The X parameter represents the feature matrix X_train
# The y parameter represents the target variable y_train
# The cv parameter specifies the number of folds or subsets to be created for cross-validation (in this case, 10)
cv_linear = cross_val_score(estimator=regressor_linear, X=X_train, y=y_train, cv=10)

# Printing the mean of the cross-validation scores
print("CV: ", cv_linear.mean())

##  Exercise: Use another Regression model