# Introduction
 
![header.png](https://raw.githubusercontent.com/satishgunjal/images/master/fish.png)

In this study I am using Python 3 environment to create a machine learning model to predict the weight of the fish based on the body measurement data of seven types of fish species. You can download the dataset from Kaggle. [Fish market](https://www.kaggle.com/aungpyaeap/fish-market)
 
I am going to use Linear model from sklearn library. Since there are multiple features its **Multiple/Multi Variable Linear Regression** problem.
 
I have documented the code and tried to explain every important concept or library I have using during this study. I hope it will be helpful.



# Step 1: Import The Required Libraries
* numpy : Numpy is the core library for scientific computing in Python. It is used for working with arrays and matrices.
* pandas: Used for data manipulation and analysis
* matplotlib : It’s plotting library, and we are going to use it for data visualization
* seaborn : It is also data visualization library, based on matplotlib
* linear_model: Sklearn linear regression model
* train_test_split : helper function from Sklearn library for splitting the dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split

# Step 2: Load The Data

In [None]:
df = pd.read_csv("/kaggle/input/fish-market/Fish.csv")
print('Shape of dataset= ', df.shape) # To get no of rows and columns
df.head(5) # head(n) returns first n records only. Can also use sample(n) for random n records.

 
# Step 3: Understand The Data
* There are total 159 rows(training samples) and 7 columns in dataset. 
* Each column details are as below 
 
| Column Name | Details
| ------------|--------------
| Species     | Species name of fish 
| Weight      | Weight of fish in gram     
| Length1     | Vertical length in CM
| Length2     | Diagonal length in CM
| Length3     | Cross length in CM
| Height      | Height in CM
| Width       | Diagonal width in CM   
 
* Features/input values/independent variables are 'Species', 'Length1','Length2', 'Length3', 'Height' and 'Width'
* Target/output value/dependent variable is 'Weight'
* So, we have to estimate the weight of the fish based on its measurement values.
 
Let's change the name of columns lenght1,length2 and length3  as per the content of it.

In [None]:
df.rename(columns={'Length1':'VerticalLen','Length2':'DiagonalLen','Length3':'CrossLen'},inplace = True) # 'inplace= true' to make change in current dataframe
df.sample(5) # Display random 5 records

Let's print the detailed information about our dataset

In [None]:
df.info()

# Step 4: Data Analysis, Cleaning and Visualization

## Check for missing values

In [None]:
# isna() will return 'True' is value is 'None' or 'numpy.NaN'
# Characters such as empty strings '' or 'numpy.inf' are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True)
# you can also use df.isnull()
df.isna().sum() # Get sum of all Nan values from each column
#df.isna().values.any()  

Good, there no null values in our dataset.

## Get count for each species

In [None]:
df.Species.value_counts()

Above function gives us required values but lets create dataframe for species so that we can use it for better visualization

In [None]:
df_sp = df.Species.value_counts()
df_sp = pd.DataFrame(df_sp)
df_sp.T 
# Note: Just like matrices. 'dataframe.T' will Transpose index and columns
# I am using it just for saving vertical space and making notbook more readable

In [None]:
sns.barplot(x= df_sp.index, y = df_sp.Species) # df_sp.index will returns row labels of dataframe
plt.xlabel('Species')
plt.ylabel('Count of Species')
plt.rcParams["figure.figsize"] = (10,6)
plt.title('Fish Count Based On Species')
plt.show()

## Important Points
 
* As you can see our dataset is very small. We have only 6 training example for 'Whitefish' species. 
* Ideal approach would be to divide the dataset and do the prediction for each species. But since we don't have enough data we will ignore the different species during our analysis.

## Using Domain Knowledge For Data Cleaning
* Depending on the maximum and minimum weight of fish for each species we can very easily remove the outliers. But because of limited data we are going to ignore the individual species and treat them as one. 
* Now lets use some common sense and find and remove the training data where weight of fish is 0 or negative

In [None]:
df[df.Weight <= 0]

Lets drop the training data at row 40. Note: Anytime we make changes in dataframe we are going to increament the dataframe name by 1

In [None]:
df1 = df.drop([40])
print('New dimension of dataset is= ', df1.shape)
df1.head(5)

## Correlation Check
* Correlation helps us investigate and establish relationships between variables
* Note that high amount of correlation between independent variables suggest that linear regression estimation will be unreliable

In [None]:
df1.corr()

In [None]:
plt.rcParams["figure.figsize"] = (10,6) # Custom figure size in inches
sns.heatmap(df1.corr(), annot =True)
plt.title('Correlation Matrix')

## Reading Correlation Matrix 
* Correlation coefficient range from -1 to +1
* Sign(+/-) indicate the direction and amount indicate the strength of correlation
* +1.00 means perfect positive relationship
* 0.00 means no relationship
* -1.00 means perfect negative relationship
* The correlation between 'VerticalLen', 'DiagonalLen' and 'Crosslen' is almost 1. This will cause 'Multicolinearity' and if we don't take care of it, it may lead to unreliable predictions.
 
Let's drop the 'VerticalLen', 'DiagonalLen' and 'Crosslen' column.

In [None]:
df2 = df1.drop(['VerticalLen', 'DiagonalLen', 'CrossLen'], axis =1) # Can also use axis = 'columns'
print('New dimension of dataset is= ', df2.shape)
df2.head()

## Visulization Using Pairplot

In [None]:
sns.pairplot(df2, kind = 'scatter', hue = 'Species')

From the above pair plot, we can see that there seems to be some correlations between  Height, Width and the Weight. Note that since we have multiple species the correlation between Height and Width of all species is not exactly linear with Weight.
 
Now, since we have the final dataset ready lets analyze and remove the outliers if any

## Outlier Detection and Removal
 
* Outlier is an extremely high or extremely low value in our data
* We use below formula to identify the outlier
  ```
    ( Greater than Q3 + 1.5 * IQR ) OR ( Lower than Q1 -1.5 * IQR )
 
    where,
    Q1  = First quartile
    Q3  = Third quartile
    IQR = Interquartile range (Q3 - Q1)
  ```
 
* Lets use box plot for outlier visualization. 
* Vertical line on the left side of box plot represent the 'min' value of dataset and vertical line on right side of box plot represent the 'max' value of dataset. Any value which is outside this range is outlier and represented by '*'

In [None]:
sns.boxplot(x=df2['Weight'])
plt.title('Outlier Detection based on Weight')

From above plot its clear that there are three outlier as per the 'Weight' data. Lets create a function to find the index of these outliers.

In [None]:
def outlier_detection(dataframe):
  Q1 = dataframe.quantile(0.25)
  Q3 = dataframe.quantile(0.75)
  IQR = Q3 - Q1
  upper_end = Q3 + 1.5 * IQR
  lower_end = Q1 - 1.5 * IQR 
  outlier = dataframe[(dataframe > upper_end) | (dataframe < lower_end)]
  return outlier

In [None]:
outlier_detection(df2['Weight'])

So based on 'Weight' data, index 142, 143 and 144 are outliers

Lets check for 'Height' data

In [None]:
sns.boxplot(x =df2['Height'])
plt.title('Outlier Detection based on Height')

There is no outlier so no need to call 'outlier_detection()' function.

Lets check for 'Width' data

In [None]:
sns.boxplot(x = df2['Width'])
plt.title('Outlier Detection based on Width')

There is no outlier so no need to call 'outlier_detection()' function.

In [None]:
df3 = df2.drop([142,143,144])
df3.shape

In [None]:
df3.describe().T

# Step 5: Build Machine Learning Model

## Create Feature Matrix X and Target Variable y

In [None]:
#X = df3.iloc[:,[2,3]] # Select columns using column index
X = df3[['Height','Width']] # Select columns using column name
X.head()

In [None]:
#y = df3.iloc[:,[1]] # Select columns using column index
y = df3[['Weight']]
y.head(5)

## Create test and train dataset
* We will split the dataset, so that we can use one set of data for training the model and one set of data for testing the model
* We will keep 20% of data for testing and 80% of data for training the model

In [None]:
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state = 42) 
# Use paramter 'random_state=1' if you want keep results same everytime you execute above code
print('X_train dimension= ', X_train.shape)
print('X_test dimension= ', X_test.shape)
print('y_train dimension= ', y_train.shape)
print('y_train dimension= ', y_test.shape)

## Ordinary Least Squares Algorithm

* Lets the train the model using Ordinary Least Squares Algorithm
* This is one of the most basic linear regression algorithm.
* Mathematical formula used by ordinary least square algorithm is as below,

   ![ordinary_least_squares_formlua.png](https://github.com/satishgunjal/images/blob/master/ordinary_least_squares_formlua_1.png?raw=true)
* The objective of Ordinary Least Square Algorithm is to minimize the residual sum of squares. Here the term residual means 'deviation of predicted value(Xw) from actual value(y)'
* Note that, problem with ordinary least square model is size of coefficients increase exponentially with increase in model complexity

In [None]:
model = linear_model.LinearRegression()
model.fit(X_train,y_train)

## Understanding Training Results
* If training is successful then we get the result like above. Where all the default values used by LinearRgression() model are displayed. If required we can also pass these values in fit method. We are not going to change any of these values for now.
* As per our hypothesis function, 'model' object contains the coef(slope of line) and intercept values

In [None]:
print('coef= ', model.coef_) # Since we have two features(Height and Width), there will be 2 coef
print('intercept= ', model.intercept_)
print('score= ', model.score(X_test,y_test))

## Predicting The Test Data
* Check below table for weight from test data and predicted weight by our model
* We will also plot the scatter plot of weight from test data vs predicted weight

In [None]:
predictedWeight = pd.DataFrame(model.predict(X_test), columns=['Predicted Weight']) # Create new dataframe of column'Predicted Weight'
actualWeight = pd.DataFrame(y_test)
actualWeight = actualWeight.reset_index(drop=True) # Drop the index so that we can concat it, to create new dataframe
df_actual_vs_predicted = pd.concat([actualWeight,predictedWeight],axis =1)
df_actual_vs_predicted.T

As you can see from above comparison, predicted weights are negative when actual weights are smaller than 20gm

We can also visualize the above comparison using scatter plots

In [None]:
plt.scatter(y_test, model.predict(X_test))
plt.xlabel('Weight From Test Data')
plt.ylabel('Weight Predicted By Model')
plt.rcParams["figure.figsize"] = (10,6) # Custom figure size in inches
plt.title("Weight From test Data Vs Weight Predicted By Model")

In [None]:
plt.scatter(X_test['Height'], y_test, color='red', label = 'Actual Weight')
plt.scatter(X_test['Height'], model.predict(X_test), color='green', label = 'Prdicted Weight')
plt.xlabel('Height')
plt.ylabel('Weight')
plt.rcParams["figure.figsize"] = (10,6) # Custom figure size in inches
plt.title('Actual Vs Predicted Weight for Test Data')
plt.legend()
plt.show()

In [None]:
plt.scatter(X_test['Width'], y_test, color='red', label = 'Actual Weight')
plt.scatter(X_test['Width'], model.predict(X_test), color='green', label = 'Prdicted Weight')
plt.xlabel('Width')
plt.ylabel('Weight')
plt.rcParams["figure.figsize"] = (10,6) # Custom figure size in inches
plt.title('Actual Vs Predicted Weight for Test Data')
plt.legend()
plt.show()

# Step 6: Evaluating the Model

Plot a histogram of the residuals.

In [None]:
sns.distplot((y_test-model.predict(X_test)))
plt.rcParams["figure.figsize"] = (10,6) # Custom figure size in inches
plt.title("Histogram of Residuals")

# Conclusion
* As you can see from above results our model score is 89.6%, which is good enough to start with.
* But one issue with prediction is negative weight values. This behavior is true for smaller(less than 20gm) weight values.
* In machine learning, every time we are solving a problem we make some choices which affect the results.
* We have also made few choices like treating all species as one since we have small dataset.
* I will try again with different approach to try an eliminate the negative weight values.