****
**The Ai Academy**
****

### Regression model.
* Predict an actual numerical value the price of a used car.
* ABC Motors, owns an e-commerce platform that acts as a middleman between people looking to buy and sell pre-owned cars. 
* They have tons of data from past sales, both through their platform and other sources. Their goal is simple, to make sales happen quickly.
* If the price is appropriate, cars will sell faster. So, ABC Motors wants to develop an algorithm to predict the price of cars based on various attributes of the car.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

* ABC Motors has data on around ***50,000 cars** that have been sold or processed in various ways. There are 19 variables associated with this problem.
* One of these variables is our target, which is the price of the car. The other variables contain information that will help us predict the car's price. 
* One variable is called **"date Crawled,"** which is of date type. This variable indicates when the ad for the car was first crawled or viewed. Essentially, it tells us when people first started looking at this car.
* Another variable is **"name,"** and it’s a bit more complex. This variable, which is a string, can include the car name, brand, model, and more. It’s a composite string, and you'll notice in this dataset that the information isn't always consistent in order.
* The **"seller"** type, variable indicates whether the seller is a private individual or a commercial entity. Then there's the "offer type," which tells us if the buyer made an offer after seeing a specific car or if the price is what the seller asked for initially.
* The **"price"** variable is our outcome variable – it’s the listed price to sell the car. This is what we're trying to predict. The way these ads are set up, they come with certain characteristics. Storm Motors also conducts specific studies, so ads could be classified as either "test" or "control," and this is noted as a string.
* **"Vehicle type"** is another string variable, indicating if the car is a cabrio, SUV, coupe, or one of five other types. The "year of registration" is an integer that tells us when the car was first registered.
* **Gearbox,"** which can be manual or automatic. We also have "power," an integer representing the car's power, and this can have multiple values.
* **Model:** This refers to the specific model type of the car. For example, if it's a Hyundai, is it an i10, i20, etc.?

* **Month of Registration:** This tells us the month in which the car was first registered.

* **Fuel Type:** This variable indicates whether the car runs on petrol, diesel, or one of five other fuel types.

* **Brand:** What brand is the car? Is it a BMW, Mercedes, or something else?
* **Not Repaired Damage.** This string variable indicates if the car has any unrepaired damage. If the value is "yes," it means there has been damage that hasn't been repaired. If it's "no," the damage has been repaired and rectified. You can see how this might significantly affect the car's price.
* **Date Created:** The date when the ad was created on Storm Motors.
* **Postal Code:** The postal code of the seller, which can provide location-specific information.
* **Last Seen:** This indicates when the ad was last seen online by a crawler, giving us an idea of the most recent activity and interest in the car.




In [5]:
df = pd.read_csv('Usedcars.csv')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50001 entries, 0 to 50000
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50001 non-null  object
 1   name                 50001 non-null  object
 2   seller               50001 non-null  object
 3   offerType            50001 non-null  object
 4   price                50001 non-null  int64 
 5   abtest               50001 non-null  object
 6   vehicleType          44813 non-null  object
 7   yearOfRegistration   50001 non-null  int64 
 8   gearbox              47177 non-null  object
 9   powerPS              50001 non-null  int64 
 10  model                47243 non-null  object
 11  kilometer            50001 non-null  int64 
 12  monthOfRegistration  50001 non-null  int64 
 13  fuelType             45498 non-null  object
 14  brand                50001 non-null  object
 15  notRepairedDamage    40285 non-null  object
 16  date

In [6]:
df.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,postalCode,lastSeen
0,30/03/2016 13:51,Zu_verkaufen,private,offer,4450,test,limousine,2003,manual,150,3er,150000,3,diesel,bmw,,30/03/2016 0:00,20257,7/4/2016 4:44
1,7/3/2016 9:54,Volvo_XC90_2.4D_Summum,private,offer,13299,control,suv,2005,manual,163,xc_reihe,150000,6,diesel,volvo,no,7/3/2016 0:00,88045,26/03/2016 13:17
2,1/4/2016 0:57,Volkswagen_Touran,private,offer,3200,test,bus,2003,manual,101,touran,150000,11,diesel,volkswagen,,31/03/2016 0:00,27449,1/4/2016 8:40
3,19/03/2016 17:50,Seat_Ibiza_1.4_16V_Reference,private,offer,4500,control,small car,2006,manual,86,ibiza,60000,12,petrol,seat,no,19/03/2016 0:00,34537,7/4/2016 4:44
4,16/03/2016 14:51,Volvo_XC90_D5_Aut._RDesign_R_Design_AWD_GSHD_S...,private,offer,18750,test,suv,2008,automatic,185,xc_reihe,150000,11,diesel,volvo,no,16/03/2016 0:00,55270,1/4/2016 23:18


*****
* let's break down the variables and understand their potential impact on predicting the price of pre-owned cars.
* Think about it this way: each variable we have could potentially influence the car's price. For instance, if a car isn't getting much attention (i.e., not many people are looking at the ad), it might be priced too high, and we might need to adjust the price to attract more buyers.
* We can group these variables into different categories:
### Vehicle Specifications
- **Gearbox**: Whether the car has a manual or automatic transmission.
- **Power**: The horsepower of the car.
- **Fuel Type**: Whether the car runs on petrol, diesel, or another fuel type.

### Condition of the Car
- **Not Repaired Damage**: Indicates if the car has unrepaired damage.
- **Kilometres**: The total distance the car has been driven, which can give us an idea of how old and worn out the car might be.

### Seller Details
- **Seller**: Whether the seller is a private individual or a commercial entity.

### Registration Details
- **Year of Registration**: The year the car was first registered.
- **Month of Registration**: The month the car was first registered.

### Car Make and Model
- **Model**: The specific model type of the car (e.g., Hyundai i10, i20).
- **Brand**: The brand of the car (e.g., BMW, Mercedes).

### Advertisement Details
- **Date Created**: The date when the ad was first created.
- **Last Seen**: The last time the ad was seen online.
- **Postal Code**: The location of the seller, which might influence the price due to regional demand.


### Steps to Approach the Problem

1. **Data Cleaning**
   - **Check for Missing Values**: Identify any missing data in our dataset and decide how to handle it.
   - **Identify Outliers**: Look for data points that don't make sense, like a car with a price of $0 or a car with an unusually low power value. Outliers can distort the predictions and need to be addressed.
   
2. **Exploratory Data Analysis (EDA)**
   - **Descriptive Statistics**: Get an overview of the data distribution, central tendencies, and spread.
   - **Visualization**: Use plots to understand the relationships between variables. This can help in identifying patterns and insights that are not immediately obvious.
   
3. **Feature Engineering**
   - **Handle Categorical Variables**: Convert categorical variables into numerical format using techniques like one-hot encoding.
   - **Scale Numerical Variables**: Normalize or standardize numerical variables to ensure they're on a comparable scale, which can improve model performance.

4. **Model Building**
   - **Train Multiple Models**: We'll try different regression algorithms (e.g., linear regression, decision trees, random forests) and evaluate their performance.
   - **Hyperparameter Tuning**: Optimize model parameters to improve performance.
   
5. **Model Evaluation**
   - **Evaluate Performance**: Use metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared to assess how well the model is performing.
   - **Choose the Best Model**: Select the model that provides the best balance of accuracy and generalization.

### Importance of Handling Outliers

Identifying and handling outliers is crucial in this case study. Outliers can skew the results and lead to inaccurate predictions. Here are two ways to handle outliers:

1. **Formal Mathematical Methods**: Techniques like the Z-score or the Interquartile Range (IQR) can be used to identify outliers quantitatively.
2. **Common Sense Approach**: Sometimes, common sense is enough to spot outliers. For instance, a car priced at $0 or a car with a power value of 1 HP is clearly an outlier and should be removed.

### Example: Identifying Outliers

Let's consider the car price as an example. If most car prices are within the range of $1,000 to $50,000, but there are a few entries with a price of $100,000 or $0, those entries could be considered outliers. Similarly, if the car's power typically ranges between 50 HP to 500 HP, any entry with an exceptionally low or high value might be an outlier.


### Combining Categories with Low Frequency

When you have categories with very low frequency, it can be beneficial to combine them with other categories to create a more compact and manageable dataset. This helps in reducing noise and making the model more robust. For example, if you have a car brand that appears only a few times, you might combine it with another brand to streamline the analysis.

### Filtering Data Based on Logical Checks

It's crucial to filter the dataset to remove any illogical or incorrect entries. For instance:
- **Price**: If there are entries with a price of $0 or an unrealistically high price, these should be removed.
- **Year of Registration**: If the year of registration is beyond 2019 or far before the expected range, these entries should be removed.
- **Power**: Similarly, if the power value of the car is unrealistic, it should be filtered out.

### Linear Regression and Random Forest

We will explore two different techniques for predicting the price of pre-owned cars:
1. **Linear Regression**: This technique tries to fit a linear model to the data. It's simple, easy to understand, and interpret. It helps in analyzing variable importance straightforwardly.
2. **Random Forest**: This is a non-linear technique that often performs better than linear regression on complex datasets. However, it is more complex and harder to interpret.

### Choosing the Right Model

Choosing between linear regression and random forest depends on:
- **Model Complexity**: Linear regression is simpler and more interpretable, while random forest can handle more complexity but is harder to explain.
- **Performance**: If the random forest provides significantly better predictions, it might be worth using despite the complexity.
- **Application**: If the model needs to be embedded in optimization processes, simplicity might be preferred.

### Regression Diagnostics and Assumption Checks

After building a linear regression model, we need to perform regression diagnostics to check if the assumptions of linear regression hold true for our data. These assumptions include:
- **Linearity**: The relationship between the independent and dependent variables should be linear.
- **Homoscedasticity**: The residuals (errors) should have constant variance.
- **Normality**: The residuals should be normally distributed.
- **Independence**: The observations should be independent of each other.


### Performance Metrics

The main performance metric for regression models is the error in prediction. Common metrics include:
- **Mean Absolute Error (MAE)**
- **Mean Squared Error (MSE)**
- **Root Mean Squared Error (RMSE)**
- **R-squared (R²)**

### Improving the Model

If the initial model doesn't perform well, consider:
- **Further Data Processing**: Cleaning or transforming the data.
- **Subsetting Data**: Building separate models for different subsets of data. For instance, creating separate models for low-priced and high-priced cars might yield better results.
