# This is a beginner friendly notebook which aims to perform exploratory data analysis using graph visualizations.
# We use the Linear regression model to predict car prices, post which we calculate the error percentage using the mean absolute error method and we try to make it better by manipulating our data input to the model(feature selection)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df_cardekho = pd.read_csv("/kaggle/input/vehicle-dataset-from-cardekho/CAR DETAILS FROM CAR DEKHO.csv")
df_cardata = pd.read_csv("/kaggle/input/vehicle-dataset-from-cardekho/car data.csv")
df_cardetails = pd.read_csv("/kaggle/input/vehicle-dataset-from-cardekho/Car details v3.csv")

In [None]:
df_cardekho.info()
df_cardekho

We do not have any null values in our data from cardekho.csv

In [None]:
df_cardekho.describe()

We have data ranging from years 1992 to 2020

# **Exploring/Visualising the data**  

Let's check how the seller type influences the selling type

In [None]:
fig, (axis1,axis2) = plt.subplots(1, 2, figsize = (20,5))
sns.countplot(x="seller_type", data=df_cardekho, ax=axis1)
sns.barplot(x="seller_type", y="selling_price",data=df_cardekho, ax=axis2 )

**Number of Indiviual sellers is the highest but Trustmark dealers are selling the cars for the highest price**

**Graph to visualize the effect of field "owner" to the selling price**

In [None]:
fig, (axis1,axis2)=plt.subplots(1,2,figsize=(20,5))
sns.countplot(x="owner",data=df_cardekho,ax=axis1)
sns.barplot(x="owner",y="selling_price",data=df_cardekho,ax=axis2)

**The number of respective owners and the selling price at which they are selling their graphs has a similar graph**

Below graph helps us understand how the transmission type affects selling price

In [None]:
fig, (axis1,axis2)=plt.subplots(1,2,figsize=(20,5))
sns.countplot(x="transmission",data=df_cardekho,ax=axis1)
sns.barplot(x="transmission",y="selling_price",data=df_cardekho,ax=axis2)

**Number of manual cars being sold is more than automatic, but automatic cars sells at a higher price**

In [None]:
sns.scatterplot(data=df_cardekho,x="km_driven",y="selling_price")

**Cars which are less driven sell for a higher price**

In [None]:
df_cardekho["Age"] = 2020-df_cardekho["year"]
sns.lineplot(data=df_cardekho,x="Age",y="selling_price")

**Newer cars sell for higher prices**

In [None]:
sns.barplot(data=df_cardekho,x="fuel",y="selling_price")


Diesel cars have the most selling price folowed by petrol.

In [None]:
sns.countplot(data=df_cardekho,x="fuel")

**Most cars to be sold are either petrol or diesel.**

In [None]:
df_seller_owner = df_cardekho.groupby(by=["seller_type","owner","transmission"])
df_seller_owner.count().sort_values(by="selling_price", ascending=False).plot(kind="bar", y="selling_price")

**The above graph gives us an insight into how "seller_type","owner","transmission" as a group influence the "selling price"**

# **Most of the fields which we have are categorical, we would have to convert them into numeric type data for working on them.**

In [None]:
df_cardekho = pd.get_dummies(df_cardekho,columns=['fuel','transmission','seller_type',"owner"],drop_first=True)
df_cardekho.info()

In [None]:
df_cardekho.head()

**We have broken down indiviual categorical fields into numeric data for example field "fuel" is now divided to fields fuel_petrol, fuel_diesel, fuel_CNG and so on.**

# **Linear Regression Model**

**We start working on the linear regression model but before that we split our data to test and training data**

In [None]:
#importing requirements
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

In [None]:
#splitting the data
(Y, X) = (df_cardekho['selling_price'].values, df_cardekho.drop(['selling_price','name'], axis = 1))
(X_train, X_test, Y_train, Y_test) = train_test_split(X, Y, test_size = 0.3)


**Training the model and calculating percentage of error**

In [None]:
lr = LinearRegression()
lr.fit(X_train,Y_train)
predicted_prices = lr.predict(X_test)

def mean_absolute_percentage_error(Y_test,predicted_prices): 
    return np.mean(np.abs((Y_test-predicted_prices) / Y_test)) * 100
mean_absolute_percentage_error(Y_test,predicted_prices)


**As we do not have a very good percentage of accuracy, let us try to remove some columns and try feeding into our regression model**

In [None]:
#splitting the data
(B, A) = (df_cardekho['selling_price'].values, df_cardekho.drop(['selling_price','name','fuel_LPG','fuel_Electric', 'year'], axis = 1))
B = B.reshape((-1,1))
(A_train, A_test, B_train, B_test) = train_test_split(A, B, test_size = 0.3)


In [None]:
lr = LinearRegression()
lr.fit(A_train,B_train)
predicted_prices = lr.predict(A_test)

def mean_absolute_percentage_error(B_test,predicted_prices): 
    return np.mean(np.abs((B_test-predicted_prices) / B_test)) * 100
mean_absolute_percentage_error(B_test,predicted_prices)

We get a slightly better error percentage,by removing fields like 'fuel_LPG','fuel_Electric' and 'year'

**Let us further try to improve on the accuracy of our model by feature selection, the criteria for it would be based on the EDA we have done before**

In [None]:
#splitting the data
(B, A) = (df_cardekho['selling_price'].values, df_cardekho.drop(['selling_price','km_driven','owner_Third Owner','owner_Fourth & Above Owner','fuel_Petrol','fuel_Diesel','name','fuel_LPG','fuel_Electric', 'year'], axis = 1))
B = B.reshape((-1,1))
(A_train, A_test, B_train, B_test) = train_test_split(A, B, test_size = 0.3)

In [None]:
lr = LinearRegression()
lr.fit(A_train,B_train)
predicted_prices = lr.predict(A_test)

def mean_absolute_percentage_error(B_test,predicted_prices): 
    return np.mean(np.abs((B_test-predicted_prices) / B_test)) * 100
mean_absolute_percentage_error(B_test,predicted_prices)

**By using feature selection we got our error percentage down from 74% to 65%, let's see if changing the model affects our error percentage positively**