# Avocado Dataset Analysis and ML Prediction

# Table of Contents

# Problem Statement

# Data Loading and Description

# * Importing packages

In [None]:
import pandas as pd
import matplotlib
matplotlib.use("Agg", warn=False)
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas_profiling
%matplotlib inline

import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
from plotly import tools

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore",category=DeprecationWarning)

Read in the Avocado Prices csv file as a DataFrame called df

In [None]:
df= pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-2/master/Projects/avocado.csv")

# * Data Profiling

# * Understanding the Avocado Dataset

Lets check our data shape:

# Dataset has 18249 rows and 14 columns.

In [None]:
df.shape

In [None]:
(18249, 14)

In [None]:
df.columns  # This will print the names of all columns.

In [None]:
Index(['Unnamed: 0', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225',
       '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type',
       'year', 'region'],
      dtype='object')

In [None]:
df.head()  # Will give you first 5 records

In [None]:
    Unnamed:0   Date       AveragePric   Total Volume   4046     4225       4770   Total Bags   Small Bags   Large Bags  XLarge Bags   type            year   region
    0          2015-12-27  1.33          64236.62       1036.7   54454.85   48.16  8696.87      8603.62      93.25       0.0           conventional    2015   Albany
    1          2015-12-20  1.35          54876.98       674.28   44638.81   58.33  9505.56      9408.07      97.49       0.0           conventional    2015   Albany
    2          2015-12-13  0.93          118220.22      794.70   109149.67  130.50 8145.35      8042.21      103.14      0.0           conventional    2015   Albany
    3          2015-12-06  1.08          78992.15       1132.00  71976.41   72.58  5811.16      5677.40      133.76      0.0           conventional    2015   Albany
    4          2015-11-29  1.28          51039.60       941.48   43838.39   75.78  6183.95      5986.26      197.69      0.0           conventional    2015   Albany

The Feature "Unnamed:0" is just a representation of the indexes, so it's useless to keep it, we'll remove it in pre-processing !

In [None]:
df.tail()  # This will print the last n rows of the Data Frame

In [None]:
         Unnamed: 0   Date        AveragePrice    Total Volume  4046     4225     4770    Total Bags   Small Bags    Large Bags   XLarge Bags   type     year   region
18244    7            2018-02-04  1.63           17074.83       2046.96  1529.20  0.00    13498.67     13066.82     431.85        0.0           organic  2018   WestTexNewMexico
18245    8            2018-01-28  1.71           13888.04       1191.70  3431.50  0.00    9264.84      8940.04      324.80        0.0           organic  2018   WestTexNewMexico
18246    9            2018-01-21  1.87           13766.76       1191.92  2452.79  727.94  9394.11      9351.80      42.31         0.0           organic  2018   WestTexNewMexico
18247    10           2018-01-14  1.93           16205.22       1527.63  2981.04  727.01  10969.54     10919.54     50.00         0.0           organic  2018   WestTexNewMexico
18248    11           2018-01-07  1.62           17489.58       2894.77  2356.13  224.53  12014.15     11988.14     26.01         0.0           organic  2018   WestTexNewMexico

In [None]:
df.info() # This will give Index, Datatype and Memory information

In [None]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 14 columns):
Unnamed: 0      18249 non-null int64
Date            18249 non-null object
AveragePrice    18249 non-null float64
Total Volume    18249 non-null float64
4046            18249 non-null float64
4225            18249 non-null float64
4770            18249 non-null float64
Total Bags      18249 non-null float64
Small Bags      18249 non-null float64
Large Bags      18249 non-null float64
XLarge Bags     18249 non-null float64
type            18249 non-null object
year            18249 non-null int64
region          18249 non-null object
dtypes: float64(9), int64(2), object(3)
memory usage: 1.9+ MB

Well as a first observation we can see that we are lucky, we dont have any missing values (18249 complete data) and 13 columns. Now let's do some Feature Engineering on the Date Feature in pre-processing later so we can be able to use the day and the month columns in building our machine learning model later. ( I didn't mention the year because its already there in data frame)

In [None]:
# Use include='all' option to generate descriptive statistics for all columns
# You can get idea about which column has missing values using this
df.describe()

In [None]:
        Unnamed: 0     AveragePrice   Total Volume   4046           4225           4770           Total Bags      Small Bags   Large Bags      XLarge Bags    Year
count   18249.000000   18249.000000   1.824900e+04   1.824900e+04   1.824900e+04   1.824900e+04   1.824900e+04   1.824900e+04  1.824900e+04   18249.000000    18249.00000
mean    24.232232      1.405978       8.506440e+05   2.930084e+05   2.951546e+05   2.283974e+04   2.396392e+05   1.821947e+05  5.433809e+04   3106.426507     2016.147899
std     15.481045      0.402677       3.453545e+06   1.264989e+06   1.204120e+06   1.074641e+05   9.862424e+05   7.461785e+05  2.439660e+05   17692.894659    0.939938
min     0.000000       0.440000       8.456000e+01   0.000000e+00   0.000000e+00   0.000000e+00   0.000000e+00   0.000000e+00  0.000000e+00   0.000000        2015.000000
25%     10.000000      1.100000       1.083858e+04   8.540700e+02   3.008780e+03   0.000000e+00   5.088640e+03   2.849420e+03  1.274700e+02   0.000000        2015.000000
50%     24.000000      1.370000       1.073768e+05   8.645300e+03   2.906102e+04   1.849900e+02   3.974383e+04   2.636282e+04  2.647710e+03   0.000000        2016.000000
75%     38.000000      1.660000       4.329623e+05   1.110202e+05   1.502069e+05   6.243420e+03   1.107834e+05   8.333767e+04  2.202925e+04   132.500000      2017.000000
max     52.000000      3.250000       6.250565e+07   2.274362e+07   2.047057e+07   2.546439e+06   1.937313e+07   1.338459e+07  5.719097e+06   551693.650000   2018.000000

We can see all columns having count 18249. Looks like it doesn't contain missing values

In [None]:
df.isnull().sum()  # Will show you null count for each column, but will not count Zeros(0) as null

In [None]:
Unnamed: 0      0
Date            0
AveragePrice    0
Total Volume    0
4046            0
4225            0
4770            0
Total Bags      0
Small Bags      0
Large Bags      0
XLarge Bags     0
type            0
year            0
region          0
dtype: int64


We can see that no missing values exist in dataset, that's great!

# Preprocessing

In [None]:
profile = pandas_profiling.ProfileReport(df)
profile.to_file(outputfile="avocado_before_preprocessing.html")

.I have done Pandas Profiling before preprocessing dataset, so we can get initial observations from the dataset in better visual   aspects, to find correlation matrix and sample data. File was saved as html file avocado_before_preprocessing.html.

.Will take a look at the file and see what useful insight you can develop from it.

Initial observation as a result from profiling of Avocado Dataset can be seen in avocado_before_preprocessing.html

# Warnings

# * Preprocessing

The Feature "Unnamed:0" is just a representation of the indexes, so it's useless to keep it, lets remove it now !

In [None]:
df.drop('Unnamed: 0',axis=1,inplace=True)

Lets check our data head again to make sure that the Feature Unnamed:0 is removed

In [None]:
df.head()

In [None]:
     Date        AveragePrice  Total Volume   4046     4225       4770    Total Bags  Small Bags  Large Bags  XLarge Bags  type          year  region
0   2015-12-27   1.33          64236.62       1036.74  54454.85   48.16   8696.87     8603.62     93.25       0.0          conventional  2015  Albany
1   2015-12-20   1.35          54876.98       674.28   44638.81   58.33   9505.56     9408.07     97.49       0.0          conventional  2015  Albany
2   2015-12-13   0.93          118220.22      794.70   109149.67  130.50  8145.35     8042.21     103.14      0.0          conventional  2015  Albany
3   2015-12-06   1.08          78992.15       1132.00  71976.41   72.58   5811.16     5677.40     133.76      0.0          conventional  2015  Albany
4   2015-11-29   1.28          51039.60       941.48   43838.39   75.78   6183.95     5986.26     197.69      0.0          conventional  2015  Albany

Earlier in info we have seen that Date is Object type not the date type. We have to change its type to date type.

In [None]:
df['Date']=pd.to_datetime(df['Date'])
df['Month']=df['Date'].apply(lambda x:x.month)
df['Day']=df['Date'].apply(lambda x:x.day)

Lets check the head to see what we have done:

In [None]:
df.head()

In [None]:
    Date         AveragePrice  Total Volume  4046     4225      4770   Total Bags  Small Bags  Large Bags   XLarge Bags   type          year  region   Month  Day
0   2015-12-27   1.33          64236.62      1036.74  54454.85  48.16   8696.87    8603.62     93.25        0.0           conventional  2015  Albany   12     27
1   2015-12-20   1.35          54876.98      674.28   44638.81  58.33   9505.56    9408.07     97.49        0.0           conventional  2015  Albany   12     20
2   2015-12-13   0.93          118220.22     794.70   109149.67 130.50  8145.35    8042.21     103.14       0.0           conventional  2015  Albany   12     13
3   2015-12-06   1.08          78992.15      1132.00  71976.41  72.58   5811.16    5677.40     133.76       0.0           conventional  2015  Albany   12     6
4   2015-11-29   1.28          51039.60      941.48   43838.39  75.78   6183.95    5986.26     197.69       0.0           conventional  2015  Albany   11     29

# Data Visualisation and Questions answered

*Organic vs Conventional** : The main difference between organic and conventional food products are the chemicals involved during production and processing. The interest in organic food products has been rising steadily over the recent years with new health super fruits emerging. Let's see if this is also the case with our dataset

# Q.1 Which type of Avocados are more in demand (Conventional or Organic)?

In [None]:
Type=df.groupby('type')['Total Volume'].agg('sum')

values=[Type['conventional'],Type['organic']]
labels=['conventional','organic']

trace=go.Pie(labels=labels,values=values)
py.iplot([trace])

Just over 2% of our dataset is organic. So looks like Conventional is in more demand. Now, let's look at the average price distribution

# Q.2 In which range Average price lies, what is distribution look like?

In [None]:
sns.set(font_scale=1.5) 
from scipy.stats import norm
fig, ax = plt.subplots(figsize=(15, 9))
sns.distplot(a=df.AveragePrice, kde=False, fit=norm)


It seems like you're trying to create a distribution plot using Seaborn and SciPy. However, it appears that you haven't imported the necessary libraries, such as Seaborn and Matplotlib. Also, the variable df.AveragePrice is referenced without defining the dataframe df. You need to ensure you have imported your data and assigned it to df before running this code.

Here's a modified version of your code that includes the necessary imports and assumes you have a dataframe named df with a column named AveragePrice:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

# Assuming you have a dataframe named df with a column named AveragePrice
sns.set(font_scale=1.5) 
fig, ax = plt.subplots(figsize=(15, 9))
sns.distplot(a=df['AveragePrice'], kde=False, fit=norm)
plt.show()

Average Price distribution shows that for most cases price of avocado is between 1.1, 1.4.

Let's look at average price of conventional vs. organic.

# Q.3 How Average price is distributed over the months for Conventional and Organic Types?

In [None]:
plt.figure(figsize=(18,10))
sns.lineplot(x="Month", y="AveragePrice", hue='type', data=df)
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(18, 10))
sns.lineplot(x="Month", y="AveragePrice", hue='type', data=df)
plt.title('Average Price Over Months by Type')
plt.xlabel('Month')
plt.ylabel('Average Price')
plt.show()

Looks like there was a hike between months 8 – 10 for both Conventional and Organic type of Avocados prices

# Now lets plot Average price distribution based on region

# Q.4 What are TOP 5 regions where Average price are very high?

In [None]:
region_list=list(df.region.unique())
average_price=[]

for i in region_list:
    x=df[df.region==i]
    region_average=sum(x.AveragePrice)/len(x)
    average_price.append(region_average)

df1=pd.DataFrame({'region_list':region_list,'average_price':average_price})
new_index=df1.average_price.sort_values(ascending=False).index.values
sorted_data=df1.reindex(new_index)

plt.figure(figsize=(24,10))
ax=sns.barplot(x=sorted_data.region_list,y=sorted_data.average_price)

plt.xticks(rotation=90)
plt.xlabel('Region')
plt.ylabel('Average Price')
plt.title('Average Price of Avocado According to Region')

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

region_list = list(df.region.unique())
average_price = []

for region in region_list:
    region_data = df[df.region == region]
    region_average = region_data.AveragePrice.mean()
    average_price.append(region_average)

df1 = pd.DataFrame({'region_list': region_list, 'average_price': average_price})
sorted_data = df1.sort_values(by='average_price', ascending=False)

plt.figure(figsize=(24, 10))
ax = sns.barplot(x='average_price', y='region_list', data=sorted_data, palette='viridis')

plt.xlabel('Average Price')
plt.ylabel('Region')
plt.title('Average Price of Avocado According to Region')
plt.show()

# Q.5 What are TOP 5 regions where Average consumption is very high?

In [None]:
filter1=df.region!='TotalUS'
df1=df[filter1]

region_list=list(df1.region.unique())
average_total_volume=[]

for i in region_list:
    x=df1[df1.region==i]
    average_total_volume.append(sum(x['Total Volume'])/len(x))
df3=pd.DataFrame({'region_list':region_list,'average_total_volume':average_total_volume})

new_index=df3.average_total_volume.sort_values(ascending=False).index.values
sorted_data1=df3.reindex(new_index)

plt.figure(figsize=(22,10))
ax=sns.barplot(x=sorted_data1.region_list,y=sorted_data1.average_total_volume)

plt.xticks(rotation=90)
plt.xlabel('Region')
plt.ylabel('Average of Total Volume')
plt.title('Average of Total Volume According to Region')

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

filter1 = df.region != 'TotalUS'
df1 = df[filter1]

region_list = list(df1.region.unique())
average_total_volume = []

for region in region_list:
    region_data = df1[df1.region == region]
    region_average = region_data['Total Volume'].mean()
    average_total_volume.append(region_average)

df3 = pd.DataFrame({'region_list': region_list, 'average_total_volume': average_total_volume})

plt.figure(figsize=(22, 10))
ax = sns.barplot(x='region_list', y='average_total_volume', data=df3, order=df3.sort_values('average_total_volume', ascending=False).region_list)

plt.xticks(rotation=90)
plt.xlabel('Region')
plt.ylabel('Average of Total Volume')
plt.title('Average of Total Volume According to Region')
plt.show()

This code will generate a bar plot showing the average total volume of avocado sales for each region, sorted in descending order. Adjust the figure size, axis labels, and title as needed for your visualization.

# Q.6 In which year and for which region was the Average price the highest?

In [None]:
g = sns.factorplot('AveragePrice','region',data=df,
                   hue='year',
                   size=18,
                   aspect=0.7,
                   palette='Blues',
                   join=False,
              )

In [None]:
import seaborn as sns

g = sns.catplot(x='AveragePrice', y='region', data=df,
                   hue='year',
                   height=18,
                   aspect=0.7,
                   palette='Blues',
                   kind='strip',  # 'strip' plot will show individual data points
                   jitter=False  # Set jitter to False to align points on the categorical axis
              )
g.set_xticklabels(rotation=90)  # Rotate x-axis labels for better readability
g.set_axis_labels('Average Price', 'Region')  # Set axis labels
g.fig.suptitle('Average Price of Avocados by Region and Year')  # Set title
plt.show()

This will create a categorical scatter plot (strip plot) where each point represents the average price of avocados for a specific region and year. The points will be colored based on the year. Adjust the height, aspect, palette, and other parameters as needed for your visualization.

Looks like there was a huge increase in Avocado prices as the demand was little high in Year 2017 in SanFranciso region. If you'll search it on google, you'll find the same.

# Q.7 How price is distributed over the date column?

Now lets do some plots!! I'll start by plotting the Avocado's Average Price through the Date column

In [None]:
byDate=df.groupby('Date').mean()
plt.figure(figsize=(12,8))
byDate['AveragePrice'].plot()
plt.title('Average Price')

It appears you're aggregating your data by date, calculating the mean of the 'AveragePrice' column for each date, and then plotting the average price over time. Here's how you can do that using Matplotlib:

In [None]:
import matplotlib.pyplot as plt

# Assuming you have a DataFrame named df with a 'Date' column and an 'AveragePrice' column
byDate = df.groupby('Date').mean()
plt.figure(figsize=(12, 8))
byDate['AveragePrice'].plot()
plt.title('Average Price Over Time')
plt.xlabel('Date')
plt.ylabel('Average Price')
plt.show()

This will generate a line plot showing how the average price changes over time. Adjust the title, x-axis label, and y-axis label as needed for your visualization.

. This also shows there was a huge hike in prices after July 2017 and before Jan 2018. This was also confirmed in earlier graph    too.
. Cool right? now lets have an idea about the relationship between our Features(Correlation)

# Q.8 How dataset features are correlated with each other?

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df.corr(),cmap='coolwarm',annot=True)

Your code intends to create a heatmap using Seaborn to visualize the correlation between different numerical variables in your DataFrame. It's a great way to quickly identify relationships between variables. Here's the corrected version of your code:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.heatmap(df.corr(), cmap='coolwarm', annot=True)
plt.title('Correlation Heatmap')
plt.show()

As we can from the heatmap above, all the Features are not correleted with the Average Price column, instead most of them are correlated with each other. So now I am bit worried because that will not help us get a good model. Lets try and see.



First we have to do some Feature Engineering on the categorical Features : region and type

# * Feature Engineering for Model building

In [None]:
df['region'].nunique()

In [None]:
54

In [None]:
df['type'].nunique()

In [None]:
2

As we can see we have 54 regions and 2 unique types, so it's going to be easy to to transform the type feature to dummies, but for the region its going to be a bit complex, so I decided to drop the entire column.

I will drop the Date Feature as well because I already have 3 other columns for the Year, Month and Day.

In [None]:
df_final=pd.get_dummies(df.drop(['region','Date'],axis=1),drop_first=True)

In [None]:
df_final.head()

In [None]:
   veragePrice  Total Volume  4046     4225       4770    Total Bags  Small Bags   Large Bags   XLarge Bags  year   Month   Day   type_organic
0  1.33         64236.62      1036.74  54454.85   48.16   8696.87     8603.62      93.25        0.0          2015     12     27     0
1  1.35         54876.98      674.28   44638.81   58.33   9505.56     9408.07      97.49        0.0          2015     12     27     0
2  0.93         118220.22     794.70   109149.67  130.50  8145.35     8042.21      103.14       0.0          2015     12     13     0
3  1.08         78992.15      1132.00  71976.41   72.58   5811.16     5677.40      133.76       0.0          2015     12     6      0
4  1.28         51039.60      941.48   43838.39   75.78   6183.95     5986.26      197.69       0.0          2015     11     29     0

In [None]:
df_final.tail()

In [None]:
        AveragePrice  Total Volume  4046     4225     4770    Total Bags  Small Bags   Large Bags   XLarge Bags  year   Month  Day  type_organic
18244   1.63          17074.83      2046.96  1529.20  0.00    13498.67    13066.82     431.85       0.0          2018   2      4    1
18245   1.71          13888.04      1191.70  3431.50  0.00    9264.84     8940.04      324.80       0.0          2018   1      28   1
18246   1.87          13766.76      1191.92  2452.79  727.94  9394.11     9351.80      42.31        0.0          2018   1      21   1
18247   1.93          16205.22      1527.63  2981.04  727.01  10969.54    10919.54     50.00        0.0          2018   1      14   1
18248   1.62          17489.58      2894.77  356.13   224.53  12014.15    11988.14     26.01        0.0          2018   1      7    1

# * Model selection/predictions

Now our data are ready! lets apply our model which is going to be the Linear Regression because our Target variable 'AveragePrice' is continuous.

Let's now begin to train out regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable.

# P.1 Are we good with Linear Regression? Lets find out

In [None]:
X=df_final.iloc[:,1:14]
y=df_final['AveragePrice']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

Creating and Training the Model

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X_train,y_train)
pred=lr.predict(X_test)

In [None]:
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, pred))
print('MSE:', metrics.mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

The RMSE is low so we can say that we do have a good model, but lets check to be more sure.
Lets plot the y_test vs the predictions

In [None]:
plt.scatter(x=y_test,y=pred)

As we can see that we don't have a straight line so I am not sure that this is the best model we can apply on our data

Lets try working with the Decision Tree Regression model

# P.2 Are we good with Decision Tree Regression? Lets find out.

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr=DecisionTreeRegressor()
dtr.fit(X_train,y_train)
pred=dtr.predict(X_test)

In [None]:
plt.scatter(x=y_test,y=pred)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

It seems like you're trying to create a scatter plot to visualize the relationship between the actual values (y_test) and the predicted values (pred). This is a common approach for evaluating the performance of a regression model. Here's how you can create the scatter plot using Matplotlib:

Nice, here we can see that we nearly have a straight line, in other words its better than the Linear regression model, and to be more sure lets check the RMSE

In [None]:
import matplotlib.pyplot as plt

# Assuming you have y_test and pred arrays/lists containing the actual and predicted values respectively
plt.scatter(x=y_test, y=pred)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

This will plot the actual values on the x-axis and the predicted values on the y-axis, allowing you to visually assess how closely they align. If the model predictions are accurate, the points should fall along a diagonal line from the bottom left to the top right of the plot.








In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, pred))
print('MSE:', metrics.mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

In [None]:
MAE: 0.13404109589041097
MSE: 0.04273295890410959
RMSE: 0.2067195174726121

Very Nice, our RMSE is lower than the previous one we got with Linear Regression. Now I am going to try one last model to see if I can improve my predictions for this data which is the RandomForestRegressor

# P.3 Are we good with Random Forest Regressor? Lets find out.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rdr = RandomForestRegressor()
rdr.fit(X_train,y_train)
pred=rdr.predict(X_test)

In [None]:
C:\Anaconda\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning:

numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.


In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, pred))
print('MSE:', metrics.mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

Well as we can see the RMSE is lower than the two previous models, so the RandomForest Regressor is the best model in this case.

In [None]:
sns.distplot((y_test-pred),bins=50)

Notice here that our residuals looked to be normally distributed and that's really a good sign which means that our model was a correct choice for the data.

# Lets see final Actual Vs Predicted sample.

In [None]:
data = pd.DataFrame({'Y Test':y_test , 'Pred':pred},columns=['Y Test','Pred'])
sns.lmplot(x='Y Test',y='Pred',data=data,palette='rainbow')
data.head()

In [None]:
      Y Test   Pred
8604   0.82    0.993
2608   0.97    0.998
14581  1.44    1.344
4254   0.97    0.894
16588  1.45    1.451

# * Conclusions

With the help of notebook I learnt how EDA can be carried out using Pandas and other plotting libraries.

Also I have seen making use of packages like matplotlib, plotly and seaborn to develop better insights about the data.


I have also seen how preproceesing helps in dealing with missing values and irregualities present in the data. I also learnt how to create new features which will in turn help us to better predict the survival.

I also make use of pandas profiling feature to generate an html report containing all the information of the various features present in the dataset.

I have seen the impact of columns like type, year/date on the Average price increase/decrease rate.

The most important inference drawn from all this analysis is, I get to know what are the features on which price is highly positively and negatively coorelated with

I came to know through analysis which model will be work with better accuracy with the help of low residual and RMSE scores.

This project helped me to gain insights and how I should go with flow, which model to choose first and go step by step to attain results with good accuracy. Also get to know where to use Linear, Decision Tree and other applicable and required models to fine tune the predictions.