# Belarus Used Cars Prices

Hello, I am using this data set to build my first linear regression in scikit-learn. I would really appreciate any feed back if I have made a small or serious error. I am attempting to go through the entire machine learning work flow. Looking forward to getting started on kaggle and I hope this notebook at least provides some value to beginners like my self.

## Data Source

In [None]:
# Uncomment and run cellto get to the data source

import webbrowser
#webbrowser.open('https://www.kaggle.com/slavapasedko/belarus-used-cars-prices')

# Data Description

**Context** <br>
This is a file that represents market of cars on sale. <br>

**Content** <br>
This data was collected on the Internet and represents the car market. Dataset was collected at the dawn of 2019. <br>

**Columns**
1. make - machine firm <br>
2. model - model :) <br>
3. price USD - price in dollars (target variable) <br>
4. year - production year <br>
5. condition - represents the condition at the sale moment (with mileage, for parts, etc) <br>
6. mileage - mileage in kilometers <br>
7. fuel_type - type of the fuel (electro, petrol, diesel) <br>
8. volume of the engine <br>
9. color <br>
10. transmission <br>
11. drive unit <br>
12. segment (this feature was collected manually, so it could be wrong) <br>

**Inspiration** <br>
This is a dataset for linear regression training skills, also we can visualize the data

# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data

## Read in Data

In [None]:
df = pd.read_csv('../input/belarus-used-cars-prices/cars.csv')

## Data Information

First check the information of the data set. See what each feature data type is and whether or not it has missing values.

In [None]:
df.info()

After looking through each feature the data types appear to be correct. Some of the features have missing values and we will deal with them later on. <br>

## Peak at Data

Lets view the head and the tail of the data set

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df['transmission'].unique()

## Descriptive Statistics

Lets also get some summary statistics of our data.

In [None]:
df.describe(include='all')

Alot of features are in fact catagorical so we will need to convert them.

# Missing Data

## Explore Missing Data

Lets first see which features have missing data and how many there are

In [None]:
missing_data = df.isnull().sum() 
missing_data

We will start by removing the features that have more than 5% of there values missing

In [None]:
na_index_filter = missing_data[missing_data / len(df) <= 0.05].index

Now we can us this index to keep the columns we want. This will only drop the segment column.

In [None]:
df1 = df[na_index_filter]
df1.info()

## Impute Data / Remove rows

The volume feature have less than 1 % of missing values so we will remove those rows from the data set.

In [None]:
(df1['volume(cm3)'].isnull().sum() / len(df)) * 100

The dive_unit feature has 3.4 %. We can try to remove them or impute them if it makes sense.

In [None]:
(df1['drive_unit'].isnull().sum() / len(df)) * 100

Lets check the unique values.

In [None]:
df1['drive_unit'].unique()

Lets remove all of the rows for now as it will reduce less then 5% of the total rows of the data set. If we decide later we can come back and impute the volume feature with then mean/median and the drive unit feature with the mode.

In [None]:
df1 = df1.dropna(axis=0)
df1.info()

# Data Type Conversion

## Covert Year to Object

In [None]:
df1['year']= df1['year'].astype('object')

### Categorical Variables

In [None]:
text_cols = df1.select_dtypes(include=['object']).columns
text_cols

We have 7 categorical variables in our data set. Lets see how many unique values each one has.

In [None]:
for cat in text_cols:
    print(cat + ':' + str(len(df1[cat].unique())))

### Covert Text Columns to Category Type

In [None]:
for col in text_cols:
    df1[col]= df1[col].astype('category')

Now lets make sure each text col was converted to type category

In [None]:
df1.dtypes

A view of each category codes unique values

In [None]:
for col in text_cols:    
    print(col + ':' + str(df1[col].cat.codes.unique()))

In [None]:
df1.info()

In [None]:
text_cols

# Data Transformation

## Convert mileage(kilometers) to Miles

In [None]:
df1['mileage(miles)_sqrt'] = np.sqrt(df1['mileage(kilometers)']*0.621371)

In [None]:
df1.head()

# Feature Selection

## Numerical Features

In [None]:
num_features = df1.select_dtypes(include=[int, float]).columns.drop('priceUSD')
num_features

#### Correaltion Matrix

We can make a correlation table and a heat map since all the values are now numeric. 

In [None]:
CORR_MAP = df1.corr()
CORR_MAP

#### Heat Map

In [None]:
plt.figure(figsize=(10, 7))
sns.heatmap(CORR_MAP, annot=True)

#### Scatter Plots

#### Numerical Feature Selection

In [None]:
price_corr = df1.corr()['priceUSD'].sort_values().drop('priceUSD')
price_corr

Lets select only the features that have greater than the absolute value of 0.25

In [None]:
feature_drop = price_corr[np.abs(price_corr) <= 0.25].index
feature_drop

In [None]:
df1 = df1.drop(feature_drop, axis=1)
df1.head()

# Catagorical Data

Lets convert text data to the categorical type

In [None]:
cat_features = df1.select_dtypes(include=['category']).columns
cat_features

Lets look at some data visualizations between our categorical variables and sales price

# Dummy Encoding

Now we can dummy encode our catagical data

In [None]:
dummy_cols = pd.DataFrame()
for text_col in text_cols:
    col_dummies = pd.get_dummies(df1[text_col])
    df1 = pd.concat([df1, col_dummies], axis=1)
    del df1[text_col]

We now have a new data frame were each level is a new column and the values are all 1's and 0's

In [None]:
df1.head()

In [None]:
corr_price = df1.corr()['priceUSD'].sort_values()

In [None]:
corr_price = corr_price.drop('priceUSD')

In [None]:
other_index = corr_price[np.abs(corr_price) > 0.15].index                            

In [None]:
new_df = df1[other_index]
new_df['priceUSD'] = df1['priceUSD']

In [None]:
new_df.head()

# Training and Test Set

## Training and Test Sets

Break dataset into a training set where we can train our model and a test test to see how well our model is predicting the target variable on unseen data>

In [None]:
train_rows = len(df1) * .75
test_rows = len(df1) - train_rows
print(train_rows)
print(test_rows)
train_rows + test_rows == len(df1)

In [None]:
train_df = df1.iloc[:40720]
test_df = df1.iloc[40720:]

## Features and Target Variables

We can a variables that hold our features and target variable

In [None]:
features = train_df.drop('priceUSD', axis=1).columns
target = 'priceUSD'

# Linear Regression 

## Import Packages from SK Learn

Import classes to run regression and calculate error metric

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## First Model with few categorical features

## Model Building

Instantiate model

In [None]:
lm = LinearRegression()

Train the model on the train dataset

In [None]:
lm.fit(train_df[features], train_df[target])

## Predictions

Make predictions on the test set

In [None]:
test_predictions = lm.predict(test_df[features])

Lets make a series that contains the actual target values from the test set.

In [None]:
actual_target = test_df[target].reset_index(drop=True)
actual_target.head()

Now we can concat the actual and predicted values.

In [None]:
pred_df = pd.concat([actual_target, pd.Series(test_predictions)], axis=1, ignore_index=True)
pred_df.columns = ['Actual', 'Predicted']
pred_df.head()

## Visual of the predictions vs the actuals

Visual of the first 25 predictions

In [None]:
plot_df = pred_df.head(50)
plot_df.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

## RMSE

The rmse value is very high. This is not unexpected as our features were not highly correlated with our target.

In [None]:
rmse = np.sqrt(mean_squared_error(test_df[target], test_predictions))
rmse

The rmse value is very high. This is not unexpected as our features were not highly correlated with our target.

In [None]:
rmse = np.sqrt(mean_squared_error(test_df[target], test_predictions))
rmse

## Second Model with more categorical features

## Model Building

Instantiate model

In [None]:
lm = LinearRegression()

Train the model on the train dataset

In [None]:
lm.fit(train_df[features], train_df[target])

## Predictions

Make predictions on the test set

In [None]:
test_predictions = lm.predict(test_df[features])

Lets make a series that contains the actual target values from the test set.

In [None]:
actual_target = test_df[target].reset_index(drop=True)
actual_target.head()

Now we can concat the actual and predicted values.

In [None]:
pred_df = pd.concat([actual_target, pd.Series(test_predictions)], axis=1, ignore_index=True)
pred_df.columns = ['Actual', 'Predicted']
pred_df.head()

## Visual of the predictions vs the actuals

Visual of the first 25 predictions

In [None]:
plot_df = pred_df.head(50)
plot_df.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

## RMSE

The rmse value is very high. This is not unexpected as our features were not highly correlated with our target.

In [None]:
rmse = np.sqrt(mean_squared_error(test_df[target], test_predictions))
rmse

# Conclusion

This was a good exercise going through the machine learning workflow. I walked away with more questions than was expected so I have much to learn. I made alot of decisions that I was unsure about, especially with how to deal with categorical features.