In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from scipy import stats
from scipy.stats import kurtosis, skew

%matplotlib inline

In [None]:
df = pd.read_csv("/kaggle/input/used-car-dataset-ford-and-mercedes/audi.csv")

In [None]:
df.head(5)

### Defining the Features

#### Mileage
- Mileage gives the total distance a car has travelled since its manufacture, a typical car does an average of 1,000 miles or 1,600km in a year, which makes it an expected average of 12,000 miles or 19,200km per year. The Odometer shows the car mileage.

- High miles accumulated on Federal/Interstate roads tends to leave less wear and tear on cars and this is incontrast to the same miles travelled in the city.

- Extremely low mileage not consistent with the age of the car could mean damage or inactivity

- Mileage of above 150,000 is a bad for a used car and should be avoided when buying such.

- Factors that cause bad gas mileage - age of car, weight of car, fuel type, 


#### Gas Mileage(mpg)
- Gas mileage denotes the fuel efficiency and is measured in miles/gallon(mpg). Mileage basically tells you the distance a car can travel with a specific amount of fuel. 

- Petrol engines typically have worse fuel economy that diesel cars

- Cars that can travel good amount of distances on low fuel are regarded as high fuel efficient or with good mileage. The surge in prices of fuel has necessitated the need to focus more on cars with good mileage.

- The higher a car's mpg, the more efficient it is


#### Engine Size
- Engine size is basically the volume of cylinders in the engine. 
- Traditionally, larger sized engines produces more power as a result of bigger space for air and fuel. They also consume more fuel and adds to the weight of the car which means the gas mileage is reduced as a result of this. However they offer more acceleration than relatively smaller engines. 

- Smaller Engines on the other hand consume lesser fuel and offer better gas mileage but lesser acceleration. It is important to note that the advent of turbocharged engines has made smaller engines more powerful than some larger ones.

- Our expectation as in line with reality is that engines with larger size should be rather more expensive than smaller ones. 

#### Transmission
Semi-automatic and Automatic cars are however more expensive than Manual cars.

In [None]:
df.describe()

In [None]:
df['transmission'].value_counts()

In [None]:
df[df['mileage']==1]

A Car with a mileage of 1 pretty much implies that either the car is new or was damaged for a while. However when you consider the year in which it was produced which is 2019, it is relatively safe to assume that the car is new and hasn't been driven at all.

In [None]:
df[df['mileage']==323000]

Assuming the reference year is 2020, a car produced 12 years earlier(2008)with mileage of 323,000miles which is higher that the expected average of 144,000miles and tolerable limit of 150,000miles indicates that such car has most likely experienced much wear and tear, and should be avoided.

In [None]:
df[df['price']==145000]

The most expensive audi car in the dataset appears to have an engine size of 5.2 implying that it consume large fuel and this corroborated by the mpg value of 21. The large engine size also gives it an impressive acceleration which although is not stated. The mileage on the car also implies it has been driven much.

In [None]:
df[df['price']==1490]

In [None]:
df[df['mpg']==188.3]

In [None]:
high_mileage = df[df['mileage']>= 150000]
high_mileage

These high_mileage cars should be avoided as they ve racked up more than the tolerable limit of 150,000 miles

In [None]:
# Checking for null data
df.info()

#### Univariate Analysis
Visualizing some of the features will give us an indepth further into what the data is trying to convey

In [None]:
sns.distplot(df['mileage'], kde = False, bins = 25, color = 'green')

The distribution plot of the mileage variable shows that most of the values lie between 0 - 50000 which should presumably attract higher prices. 

In [None]:
sns.distplot(df['mpg'], kde = True, bins = 25, color = 'green')

The distribution appears to be positively skewed, while most of the values are concentrated arounnd the 50 mark.

In [None]:
sns.distplot(df['engineSize'], kde = False, bins = 15)

Most of the cars presented in the dataset have typically low engine sizes, which lie in between 1.5 and 2

In [None]:
sns.countplot(df['fuelType'])

Diesel engines are economical unlike their petrol counterparts but are invariably more expensive(more suitable for high mileage and highway drivers). Hybrid cars are more fuel efficient and so offer less acceleration.

### Multivariate Analysis
Comparison between two features

In [None]:
high_mileage

In [None]:
sns.stripplot(y = 'price', x = 'transmission', data = df)

As stated in the definion of terms above, manual cars are relatively cheaper than Semi-automatic and Automatic cars

In [None]:
sns.stripplot(y = 'mpg', x = 'fuelType', data = df)

In [None]:
df.groupby('fuelType').mean()

Considering the plot above and the grouping right below, we can quite conclude that hybrid cars typically have better fuel efficiency(mpg) than cars which run on petrol and diesel. Diesel cars also appears to slightly better off Petrol cars.
However prices of Hybrid cars tend to be higher than other type.

In [None]:
plt.scatter(df['price'], df['mileage'])

In [None]:
plt.scatter(df['mpg'], df['engineSize'])

In [None]:
plt.scatter(df['mpg'], df['price'])

In [None]:
plt.scatter(df['mileage'], df['mpg'])

It can be said from the comparison above that most cars that have racked up high mileage tend to have lower fuel efficiency(mpg)

In [None]:
sns.pairplot(df)

In [None]:
df_corr = df.corr()
df_corr

In [None]:
plt.figure(figsize=(8,6))
sns.set_context('paper', font_scale=1.4)

sns.heatmap(df_corr, annot=True, cmap='Blues')

#### BUILDING THE MODELS

In [None]:
X = df.drop(columns=['model','price'])

Y = df['price'].values

##### Encoding the categorical data

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [None]:
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [1,3])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)

#### Multiple Linear Regregression

In [None]:
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(X_train, y_train)

In [None]:
y_pred = linear.predict(X_test)

In [None]:
r2_score(y_test, y_pred)

Using Multiple Linear Regression, we get an accuracy of 80.9%

#### Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
Desc_regres = DecisionTreeRegressor(random_state = 0)
Desc_regres.fit(X_train,y_train)

In [None]:
y_desc_pred = Desc_regres.predict(X_test)

In [None]:
r2_score(y_test, y_desc_pred)

With Decision Tree Regressor, we get an accuracy of 90.1%

#### Random Forest Regression

In [None]:
from sklearn.ensemble import RandomForestRegressor
Rand_regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
Rand_regressor.fit(X_train, y_train)

In [None]:
y_rand_pred = Rand_regressor.predict(X_test)

In [None]:
r2_score(y_test, y_rand_pred)

**Random Forest appears to be the best model for predicting prices of Audi cars at 93% prediction accuracy**