# Linear Regression on Avocado Prices

### Description
I love avocado, but stuff is expensive. This is just a fun project to use some ML principles to predict the future of avocado prices, and whether this amazing delicacy will soon outprice my wallet and salary.

### Key Data Features
<ul>
    <li>Date - The date of the observation</li>
    <li>AveragePrice - the average price of a single avocado</li>
    <li>type - conventional or organic</li>
    <li>year - the year</li>
    <li>Region - the city or region of the observation</li>
    <li>Total Volume - Total number of avocados sold</li>
    <li>4046 - Total number of avocados with PLU 4046 sold</li>
    <li>4225 - Total number of avocados with PLU 4225 sold</li>
    <li>4770 - Total number of avocados with PLU 4770 sold</li>
</ul>


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
path = '/kaggle/input/avocado.csv'
df = pd.read_csv(path, keep_default_na=False)

In [None]:
df.sort_values(by=['Date'], inplace=True)

In [None]:
df.head(5)

In [None]:
# Check if there are any missing values. If we find missing values, we expect to see two types of outputs - True and False
df.isnull().any().describe()

In [None]:
def get_weighted_average(arr1, arr2):
    s1 = np.dot(arr1, arr2)
    s2 = sum(arr2)
    return s1 / s2

In [None]:
f1 = ['Date', 'AveragePrice', 'Total Volume']
dates = df.Date.unique()

arr = []
for date in dates:
    temp = df[df.Date == date].copy()
    avgPrices = temp['AveragePrice']
    totalVolume = temp['Total Volume']
    weightedAvg = get_weighted_average(avgPrices, totalVolume)
    totalVolumeDay = sum(totalVolume)
    arr.append([date, weightedAvg, totalVolumeDay])

In [None]:
headers = ['date', 'weightedAvgPrice', 'totalVolume']
df1 = pd.DataFrame(data=arr, columns=headers)

In [None]:
df1.head(5)

### Linear Regression on Total Volume to Weighted Average Prices

Is demand and supply correlated such that more demand equates to lower prices (and vice versa)?

In [None]:
X = np.array(df1['totalVolume']).reshape(-1, 1)
y = df1['weightedAvgPrice']

In [None]:
clf = LinearRegression(fit_intercept=True)
clf.fit(X, y)

In [None]:
y_pred = clf.predict(X)

In [None]:
score = mean_squared_error(y, y_pred)
score

In [None]:
plt.title('Avocado Trending Prices')
plt.xlabel('Volume')
plt.ylabel('Prices')
plt.scatter(X, y, color='black')
plt.plot(X, y_pred, color='blue')
plt.xticks(())
plt.yticks(())
plt.show()

In [None]:
df2 = df1.copy()
df2.drop(columns=['date'], inplace=True)
df2['idx'] = [ i for i in range(len(df2)) ]

In [None]:
df2.head(5)

### Linear Regression on Time to Weighted Average Prices

Are Avocado prices getting more expensive?

In [None]:
X1 = np.array(df2['idx']).reshape(-1, 1)
y1 = df1['weightedAvgPrice']
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.25)

In [None]:
clf1 = LinearRegression(fit_intercept=True)
clf1.fit(X_train, y_train)

In [None]:
y_pred = clf1.predict(X_test)

In [None]:
score = mean_squared_error(y_test, y_pred)
score

In [None]:
plt.title('Avocado Trending Prices')
plt.xlabel('Time')
plt.ylabel('Prices')
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue')
plt.xticks(())
plt.yticks(())
plt.show()