Starting Data science and this would be my first competition, and have tried exploring Ridge Regression. Any Feedaback or suggestion is very much appreciated to improve my skill and score.  

# Importing data and libraries

In [None]:
## Importing Libraries and **Data files**

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
sample = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')

print("train shape", train.shape)
print("test shape", test.shape)

# Data Analysis

Steps in data analysis will follow these following steps

*   Check out categorical and numerical data
* Check for duplicate data
*   Detect Null values 
*   Explore the data with graphs and plots (following may be )
 * distribution
 * Correlation
 * Box plot to get an idea about the outliers
* Next step would be to get the correlation between these indepnedent and dependent variables
* changes in the data points of dependent and independent data over time.
* Graphs will help tell the kind of model is needed


## Extracting information

In [None]:
print("train shape", train.shape)
train.describe().T

In [None]:
print('Data types of the columns \n\n', train.info())
print('\n\n\nTotal Null values\n', train.isnull().sum())

In [None]:
print('Total Duplicate values\n', train.duplicated().sum())

### *Inference* 1: 
* No Null values
* All except time are continuous variables of float type data
* Obviously there is a regression and time series analysis, which will be dealt accordingly.


## Univariate DA 
- Histogram and BoxPLot

In [None]:
hist = train.hist(figsize = (18, 10), bins=50, grid = False, xlabelsize=8, ylabelsize=8, layout = (3,4))

In [None]:
box = train.boxplot(figsize = (18,8), rot = 20 )

### *Inference* 2: 
* From histogram - Data is normally distributed, except the target ones mostly right skewed.
* From Box-plot - Outliers in all the sensor data, thus a outlier sensitive model has to be picked.
* From Box-Plot - Scaling has to be perfomrmed, since the data ranges are pretty wide. The data all are in different scales, thus have to be scaled accordingly.
* Because of outliers, i'll use **Robust scaler** use for scaling the features

## Bivariate DA - 

* Comparing the 3 target features with each of the independent features

In [None]:
train.columns

In [None]:
sns.pairplot(train, size = 5, 
    x_vars=['deg_C', 'relative_humidity', 'absolute_humidity','sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5'],
    y_vars=['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'], 
)

In [None]:
plt.figure(figsize=(12,9))
sns.heatmap(train.corr(), 
            annot=True, cmap="RdBu", fmt='.2f', 
            center = 0, linewidths=0.1, 
            cbar_kws={"shrink": .8}, square = True) 


### Inference 4 
* Deg-c; humidity are less correlated with the target varibalesa and less with the other inpdependent variables 
* Surprisingly some variables are negatively correlated with the target variables
* There is Multi-collinearity among the variables, thus for prediction - I'll perform **Ridge Regression.** 
* I'll compare the model acuuracy  

# Model Plan


## Steps
* Remove the date time column 
* train-test split before scaling to prevent data leakage
* Scale the test and train data seprately with Robust scalar transformation
* Perform Ridge and lasso Regression 
* Perform Accuracy with the split data 
* Final prediction 

### Data Preparation

In [None]:
m_train = train.copy()
m_train.drop(['date_time'], axis = 1, inplace = True)
m_train.columns

### Splitting and Scaling Data

In [None]:
# separating the Target from the independent varibales
x_data = m_train.drop(['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'],axis=1)
y_data = m_train[['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']]


# Splitting the data into train and test data 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=10)
print("X-train shape", x_train.shape, "\nY-test shape" ,y_test.shape)

In [None]:
# Robust Scaling

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler() # class object
x_train = pd.DataFrame(scaler.fit_transform(x_train), columns = x_train.columns)
x_test = pd.DataFrame(scaler.fit_transform(x_test), columns = x_test.columns)


## Target Carbon-Monoxide Model Creation 

Since Ridge regression from Sklearn also has a normalisation section, so i wanted to check how the model would be affected with a different scaling method. 

In [None]:
# Ridge Model Creation of Robust scaled data: w/normalisation
from sklearn.linear_model import Ridge
ridge = Ridge(fit_intercept = True, normalize = False)
ridge.fit(x_train,y_train['target_carbon_monoxide'])
ridge_predict = ridge.predict(x_test)
print('\nRidge score: w/ Robust scaled data:', ridge.score(x_test,y_test['target_carbon_monoxide']))
print('Mean Absolute % Error: ', np.mean(np.abs(ridge_predict - y_test['target_carbon_monoxide'])/       
                     np.abs(y_test['target_carbon_monoxide'])))

In [None]:

# Splitting the data into train and test data for non scaled data modelling
from sklearn.model_selection import train_test_split
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.2, random_state=10)
print("X-train1 shape", x_train1.shape, "-x--x--x-" ,  " Y-test1 shape" ,y_test1.shape)


# Ridge Model Creation of non scaled data
from sklearn.linear_model import Ridge
ridge = Ridge(fit_intercept = False, normalize = False)
ridge.fit(x_train1,y_train1['target_carbon_monoxide'])
ridge_predict = ridge.predict(x_test1)
print('\nRidge score: of non scaled data:', ridge.score(x_test1,y_test1['target_carbon_monoxide']))
print('Mean Absolute % Error: ', np.mean(np.abs(ridge_predict - y_test['target_carbon_monoxide'])/       
                     np.abs(y_test['target_carbon_monoxide'])))


It seems the robust scaled has a better ridge score and lower error %. Now i'll continue with the other and check the model.

## Target Benzene model Creation

In [None]:
# Ridge Model Creation of Robust scaled data:
from sklearn.linear_model import Ridge
ridge = Ridge(fit_intercept = True, normalize = False)
ridge.fit(x_train,y_train['target_benzene'])
ridge_predict = ridge.predict(x_test)
print('\nRidge score: of Robust scaled data: ', ridge.score(x_test,y_test['target_benzene']))
print('Mean Absolute % Error: ', np.mean(np.abs(ridge_predict - y_test['target_benzene'])/       
                     np.abs(y_test['target_benzene'])))


## Target Nitrogen Oxide - Model Creation

In [None]:
# Ridge Model Creation of Robust scaled data: w/normalisation
from sklearn.linear_model import Ridge
ridge = Ridge(fit_intercept = True, normalize = False)
ridge.fit(x_train,y_train['target_nitrogen_oxides'])
ridge_predict = ridge.predict(x_test)
print('\nRidge score: of Robust scaled data: w/intercept, w/o normalisation', ridge.score(x_test,y_test['target_nitrogen_oxides']))
print('Mean Absolute % Error: ', np.mean(np.abs(ridge_predict - y_test['target_nitrogen_oxides'])/       
                     np.abs(y_test['target_nitrogen_oxides'])))


# Final regression model

* parameters are Alpha = 1, which is by default 1; Normalise = false; Fit_intercept = true

In [None]:
test.drop(columns = 'date_time', inplace = True)

In [None]:
#  Robust Scaling of the test data

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler() # class object
test = pd.DataFrame(scaler.fit_transform(test), columns = test.columns)

In [None]:
# Final Ridge Model; 'target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides'

from sklearn.linear_model import Ridge
ridge = Ridge(alpha = 1.0, fit_intercept = True, normalize = False)
ridge.fit(x_train,y_train['target_carbon_monoxide'])
sample['target_carbon_monoxide'] = ridge.predict(test)

ridge.fit(x_train,y_train['target_benzene'])
sample['target_benzene'] = ridge.predict(test)

ridge.fit(x_train,y_train['target_nitrogen_oxides'])
sample['target_nitrogen_oxides'] = ridge.predict(test)

In [None]:
sample.to_csv('submission_Ridge regression.csv', index=False)
sample