## Multiple Linear Regression:

The aim is to predict the weight of fish.

In [1]:
#import warnings
import warnings
warnings.filterwarnings('ignore')

In [1]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import r2_score

In [1]:
#read the file
fish_data = pd.read_csv('../input/fish-market/Fish.csv')
fish_data.head()

In [1]:
#shape of the df:
fish_data.shape

In [1]:
#info of the df
fish_data.info()

In [1]:
fish_data.describe(percentiles = [0.05,0.10,0.25,0.50,0.75,0.90,0.99])

#### Let us understand the data given:

In [1]:
#let us check for null
fish_data.isnull().sum()

There is no null in the data set.

1. `Species` column:

In [1]:
fish_data.Species.value_counts()

We see some 7 variety of fish species.

In [1]:
sns.countplot(data = fish_data, x = 'Species')

In [1]:
sns.pairplot(data= fish_data, x_vars = ['Length1','Length2','Length3','Height','Width'], y_vars = 'Weight', hue = 'Species')

As we see, our dependent variable-'weight' has linear relationship with all other variables.

In [1]:
#let us check the correlation
sns.heatmap(fish_data.corr(), annot = True)

We see that there are high correlation between variables going on.

2. `Weight` column:

In [1]:
#Variable Weight
sns.boxplot(fish_data['Weight'])

We see few outliers. Let us check the row with outliers value:

In [1]:
#checking outlier rows
fish_weight = fish_data['Weight']
Q3 = fish_weight.quantile(0.75)
Q1 = fish_weight.quantile(0.25)
IQR = Q3-Q1
lower_limit = Q1 -(1.5*IQR)
upper_limit = Q3 +(1.5*IQR)

In [1]:
weight_outliers = fish_weight[(fish_weight <lower_limit) | (fish_weight >upper_limit)]
weight_outliers

We see three rows with outliers value.

3. `Length1` column:

In [1]:
sns.boxplot(fish_data['Length1'])

In [1]:
#checking outlier rows
fish_Length1 = fish_data['Length1']
Q3 = fish_Length1.quantile(0.75)
Q1 = fish_Length1.quantile(0.25)
IQR = Q3-Q1
lower_limit = Q1 -(1.5*IQR)
upper_limit = Q3 +(1.5*IQR)
length1_outliers = fish_Length1[(fish_Length1 <lower_limit) | (fish_Length1 >upper_limit)]
length1_outliers

4. `Length2` column:

In [1]:
sns.boxplot(fish_data['Length2'])

In [1]:
#checking outlier rows
fish_Length2 = fish_data['Length2']
Q3 = fish_Length2.quantile(0.75)
Q1 = fish_Length2.quantile(0.25)
IQR = Q3-Q1
lower_limit = Q1 -(1.5*IQR)
upper_limit = Q3 +(1.5*IQR)
length2_outliers = fish_Length2[(fish_Length2 <lower_limit) | (fish_Length2 >upper_limit)]
length2_outliers

5. `Length3` column:

In [1]:
sns.boxplot(fish_data['Length3'])

In [1]:
#checking outlier rows
fish_Length3 = fish_data['Length3']
Q3 = fish_Length3.quantile(0.75)
Q1 = fish_Length3.quantile(0.25)
IQR = Q3-Q1
lower_limit = Q1 -(1.5*IQR)
upper_limit = Q3 +(1.5*IQR)
length3_outliers = fish_Length3[(fish_Length3 <lower_limit) | (fish_Length3 >upper_limit)]
length3_outliers

6. `Height` column:

In [1]:
sns.boxplot(fish_data['Height'])

7. `Width` column:

In [1]:
sns.boxplot(fish_data['Width'])

We see that all the outliers of the data set line in the row 142 to 144.

In [1]:
fish_data[142:145]

In [1]:
#let us drop these rows:
df = fish_data.drop([142,143,144])

In [1]:
# let us check our df after removal of outliers
df.describe(percentiles = [0.05,0.10,0.25,0.50,0.75,0.90,0.99])

#### Data prepartion:

In [1]:
#creating dummies - to handle categorical variable.
#species_dummies = pd.get_dummies(df['Species'], prefix = 'Species' , drop_first = True)

In [1]:
# final_df = pd.concat([df,species_dummies], axis =1)
# final_df.head()

In [1]:
#dropping the original column as we have created dummies
#final_df = final_df.drop(['Species'], axis =1)

In [1]:
#final_df.shape

#### Let us split the data into train and test data:

In [1]:
df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state =100)

In [1]:
df_train.shape

In [1]:
df_test.shape

#### Rescaling the data:
As we see columns with different place value. Let us rescale it:

In [1]:
scaler = StandardScaler()

In [1]:
scaling_columns = ['Weight', 'Length1','Length2','Length3','Height','Width']
df_train[scaling_columns] = scaler.fit_transform(df_train[scaling_columns])
df_train.describe()

#### Model building:

In [1]:
y_train = df_train['Weight']
X_train = df_train.iloc[:,2:7]

In [1]:
X_train.head()

Let us initial build the model with statsmodel.api :

#### Model 1:

In [1]:
X_train_sm = sm.add_constant(X_train)
model1 = sm.OLS(y_train,X_train_sm).fit()

In [1]:
print(model1.summary())

We see some high p-values. Certainly certain variables are insignificant. Let us calculate VIF.

In [1]:
VIF = pd.DataFrame()
VIF['Features'] = X_train.columns
VIF['vif'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
VIF['vif'] = round(VIF['vif'] ,2)
VIF = VIF.sort_values(by='vif',ascending = False)
VIF

Let us drop the column Length2 as it has high VIF and p-value.

#### Model 2:

In [1]:
X2 = X_train.drop(['Length2'], axis =1)
X2_sm = sm.add_constant(X2)

model2 = sm.OLS(y_train,X2_sm).fit()

In [1]:
print(model2.summary())

In [1]:
#vif
VIF = pd.DataFrame()
VIF['Features'] = X2.columns
VIF['vif'] = [variance_inflation_factor(X2.values, i) for i in range(X2.shape[1])]
VIF['vif'] = round(VIF['vif'] ,2)
VIF = VIF.sort_values(by='vif',ascending = False)
VIF

Let us drop the column Length3 as it has high VIF.

#### Model 3:

In [1]:
X3 = X2.drop(['Length3'], axis =1)
X3_sm = sm.add_constant(X3)

model3 = sm.OLS(y_train,X3_sm).fit()

In [1]:
print(model3.summary())

In [1]:
#vif
VIF = pd.DataFrame()
VIF['Features'] = X3.columns
VIF['vif'] = [variance_inflation_factor(X3.values, i) for i in range(X3.shape[1])]
VIF['vif'] = round(VIF['vif'] ,2)
VIF = VIF.sort_values(by='vif',ascending = False)
VIF

Model 3 has all significant variables. All the VIF values and p-values are in a good range. Also the Adjusted R-squared is 88.7%. This model is explaining most of the variance without being too complex.

So our equation is:

#### Weight = 0.00000000000000000633 + 0.4184 * Length1 + Height * 0.2538 + Width * 0.3512

#### Residual analysis:

In [1]:
y_train_pred = model3.predict(X3_sm)
y_train_pred.head()

In [1]:
residual = y_train - y_train_pred
sns.distplot(residual)

Error term is normally distributed.

In [1]:
#plotting y_train and y_train_pred
c = [i for i in range(1,110,1)]
plt.plot(c, y_train,color = 'Blue')
plt.plot(c, y_train_pred,color = 'red')
plt.title('Test(Blue) vs pred(Red)')

Looking at the graph we see that, the peak values are not explained properly by the model.

#### Making Predictions:

In [1]:
# treating test columns same way as train dataset
df_test[scaling_columns] = scaler.transform(df_test[scaling_columns])
df_test.describe()

In [1]:
y_test = df_test['Weight']
X_test = df_test.iloc[:,2:7]

In [1]:
cols = X3.columns
cols

In [1]:
# considering only those columns which was part of our model 3.
X_test = X_test[cols]
X_test.columns

In [1]:
#predicting
X_test_sm = sm.add_constant(X_test)
y_pred = model3.predict(X_test_sm)

In [1]:
y_pred.head()

#### Evaluating:

In [1]:
r_square = r2_score(y_test,y_pred)
r_square

Our Adj.R-square for train set is 88.7% and R-square for test set is 88.39%.

In [1]:
#plotting y_test and y_pred
c = [i for i in range(1,48,1)]
plt.plot(c, y_test,color = 'Blue')
plt.plot(c, y_pred,color = 'red')
plt.title('Test(Blue) vs pred(Red)')

From the above graph we see that- lower and upper tips are not predicted properly by the model.