# Predicting Fish Weights with Multiple Linear Regression

The objective of this problem is to predict the weight of some fishes based on its species and dimensions.

This problem resolution is going to be organized in the following sections:

* **Section 1** - Data Analysis and Insights
* **Section 2** - Getting the training set and the test set
* **Section 3** - Getting the model
* **Section 4** - Conclusions about the model

### Section 1 - Data Analysis and Insights

In [1]:
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [1]:
#getting the data
df = pd.read_csv("../input/fish-market/Fish.csv")

In [1]:
#show the first five elements of the dataframe
df.head()

In [1]:
#segregation of numerical columns and categorical columns
cat_col = df.select_dtypes(include=['object']).columns
num_col = df.select_dtypes(exclude=['object']).columns
df_cat = df[cat_col]
df_num = df[num_col]

In [1]:
#Informations about the dataframe
df.info()

In [1]:
#analyzing if there is null values
df_null = df.isna().mean()
df_null.sort_values(ascending = False)

As can be noticed, there aren't null values in the dataset so it won't be necessary a preprocessing to take care about it.

In [1]:
df.dtypes

In [1]:
outliers = ['Weight']
plt.rcParams['figure.figsize'] = [8,8]
sns.boxplot(data=df[outliers], orient="v", palette="Set1" ,whis=1.5,saturation=1, width=0.7)
plt.title("Outliers Variable Distribution", fontsize = 14, fontweight = 'bold')
plt.ylabel("Profit Range", fontweight = 'bold')
plt.xlabel("Continuous Variable", fontweight = 'bold')
df.shape

There are 3 outliers. However, the amount of data is too low, so they won't be removed.

In [1]:
#checking for duplicates

df.loc[df.duplicated()]

Once again, it won't be necessary do any preprocessing because there aren't no duplicated row

In [1]:
#Visualizing the diferent fish species
plt.rcParams['figure.figsize'] = [5,5]
ax=df['Species'].value_counts().plot(kind='bar',stacked=True, colormap = 'Set1', color= "green")
ax.title.set_text('Number of fishes per species')
plt.xlabel("Names of the Species",fontweight = 'bold')
plt.ylabel("Count of Fishes",fontweight = 'bold')

Insight:
* The Perch specie has the largest amount of fishes
* The Whitefish specie has the least amount of fishes 

In [1]:
plt.figure(figsize=(8,8))

plt.title('Fishes\' Weight Plot')
sns.distplot(df['Weight'])

Insight:
* The distribution of weight is assimetric to the right

In [1]:
ax = sns.pairplot(df[num_col])

Insights:
* The dimensions of the fishes are linear correlated with its weight
* The dimensions also are linear correlated with each other

In [1]:
plt.figure(figsize = (8, 8))
sns.heatmap(df.corr(), cmap="RdYlGn")
plt.show()

Insight:
* The weight is correlated with all the fishes' dimensions except the height

In [1]:
plt.figure(figsize=(8, 8))
sns.boxplot(x = 'Species', y = 'Weight', data = df)
plt.show()

Insights:
* There are few outliers in the Roach and Smelt species

In [1]:
#Separating the dependent and independent values
y = df.iloc[:, 1].values
X = df.drop(columns=['Weight'])

In [1]:
#transforming the categorical value
ct = ColumnTransformer(transformers = [("enconder", OneHotEncoder(), [0])],
                       remainder = "passthrough")
X = np.array(ct.fit_transform(X))

In [1]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

In [1]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [1]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

In [1]:
r2_score(y_pred, y_test)

The R2 score is 0,921 which represents an accuracy of the linear regression of 92,1%. This is a very good model to predict future values of fishes' weight.

In [1]:
print("R2 value is " + str(r2_score(y_pred, y_test)))
print("The Mean Squared Error is " + str(mean_squared_error(y_test, y_pred)))

In [1]:
# getting the coeficients of the model
regressor.coef_

In [1]:
#getting the constant of the model
regressor.intercept_

In [1]:
#Visualizing once again the dataset
df.head()

Creating a label to show the equations:

$ x_{1} $ = Lenght 1

$ x_{2} $ = Lenght 2

$ x_{3} $ = Lenght 3

$ x_{4} $ = Height

$ x_{5} $ = Width

$ d_{1} $ = Boolean variable to represent if the specie is Bream

$ d_{2} $ = Boolean variable to represent if the specie is Parkki

$ d_{3} $ = Boolean variable to represent if the specie is Perch

$ d_{4} $ = Boolean variable to represent if the specie is Pike

$ d_{5} $ = Boolean variable to represent if the specie is Roach

$ d_{6} $ = Boolean variable to represent if the specie is Smelt

$ d_{7} $ = Boolean variable to represent if the specie is WhiteFish

$y$ = Weight


So, the equation is:

$y = -808.19  - 109.91d_{1} + 42.93d_{2} + 41.33d_{3} - 260.14d_{4} + 4.42d_{5} + 347.87d_{6} -66.51d_{7} - 89.11x_{1} + 87.41x_{2} + 31.63x_{3} + 7.76x_{4} + 2.36x_{5}$

We can notice that, based on the values of the coeficients, the Smelt specie has the most positive impact in the weight and the Pike Specie has the most negative impact.
Also, the Lenght 2 has the greater positive impact in the weight and the lenght 1 has the greater negative impact.