# Automodeals

### NOTE: Prior to running this notebook, ensure the following packages are installed and that ipyleaflet's extension for jupyter is enabled (otherwise maps will not display):

In [None]:
import pandas as pd # must be version 1.0.1 or later!
import os
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import statsmodels.formula.api as sm
import seaborn as sns

import json
import pickle

from ipyleaflet import Map, basemaps, GeoJSON, Popup, FullScreenControl, CircleMarker, LayerGroup
# if above module isn't installed, do BOTH of the following:
# pip install ipyleaflet
# jupyter nbextension enable --py --sys-prefix ipyleaflet

from ipywidgets import HTML # pip install ipywidgets
from uszipcode import SearchEngine # pip install uszipcode

## Basic Information

Title: KSL AutomoDeals
<br>
Names: Chantel Charlebois, Taylor Hansen, Michael Paskett
<br>
E-mails: chantel.charlebois@utah.edu, taylor.c.hansen@utah.edu, michael.paskett@utah.edu
<br>
UIDS: u1043299, u0642850, u1144000

### Background and Motivation

Almost every person in Utah (and some neighboring states) buying a used car will visit KSL Cars classifieds to look for their new wheels. There are a few resources for understanding the rough value of a used car, such as Kelly Blue Book (https://www.kbb.com) and CarGurus (https://www.cargurus.com/), but such services may not fully incorporate the complex auto market of a local area. By storing and analyzing the prices, details, and options for a certain model or class of vehicle, a prospective buyer can evaluate how good the listed price for a vehicle actually is. With such a model, the user can estimate how much a specific car is really worth, and determine if the vehicle is worthy of a test drive.

### Project Objectives

Questions:
* How well can we predict the price of a newly-listed car based on the attributes available in an a ksl advertisement?
* Which attributes are most influential in determining the vehicle price?
* What nearby areas have the best price of cars?

Aims:
* Create a regression model that will predict the expected price of a car based on several attributes, such as:
    * year,  seller type (dealer/private), mileage, color, title (clean/salvaged), transmission type, cylinders, fuel type, number of doors, exterior/interior condition, listing date, page views rate
* Create a clustered map of “good deals” in different regions

Benefits:
* This project could benefit anyone in the market for a used car in the KSL area, helping them to be informed of the potential value of a car they are interested in.

### Ethical Considerations

Stakeholders:
* The creators (us)
* The seller
* The prospective buyer
* KSL

Our incentive as creators and prospective buyers is to find good deals without having to manually spend hours searching through KSL for a good deal. For other prospective buyers, the same applies. The sellers have competing interests, as they would like to sell their car for as much as possible. KSL also has a stake in this project, as it makes revenue from ads and from sellers paying for better listings in order to make their vehicle more prominent.

### Data

We scraped our data from queries of used cars listed on [cars.ksl.com](https://cars.ksl.com/). We read the robots.txt files for both ksl.com and cars.ksl.com to confirm that there were no restrictions for crawling their website. We also reviewed the terms and conditions and similarly found no indication for rules on crawling. There used to be an undocumented API for interfacing with KSL (as of four years ago), but it is no longer publicly accessible, so we manually scraped the data with BeautifulSoup.

To avoid consistently using too much bandwidth on their website, our initial plan was to start with “historical” data collection by saving .html pages over the course of the project so that we could parse them offline. However, we found that KSL listings are dynamically loaded by JavaScript, and the html files were bloated with other associated files which would be too unwieldy to download on a daily basis (on the order of 5,000 cars/day). Instead, we opted for a live crawler approach to get our data.

After an initial crawl through the website to get as many used cars listings as possible, we automated the crawler to go in every night at a randomly scheduled time and pull any newly listed cars not already in our repository. The code associated with this crawler was developed initially in various Jupyter notebooks before porting over to .py files for automation. The code is not included in the present notebook given its length. The crawler/scraping functions are described in more detail in the Project Milestone.

We also used CarGurus website https://www.cargurus.com/Cars/sell-car/?pid=SellMyCarDesktopHeader and their API to request an expected car price based on each car's Make, Model, Mileage, Year, and Zip Code. 

#### KSL


##### A typical search page from KSL Cars:
Links for each newly listed individual car were scraped from pages like this.
<br>
<img src='screenshots\KSL_search_page.png'>

<br>
<br>

***

<br>

##### A typical individual listing from KSL cars:
A pandas dataframe was filled with car info scraped from pages like this.
<br>
<img src='screenshots\KSL_listing.png'>

<br>
<br>

***

<br>

##### Crawler Automation:
The crawler was scheduled to run every night and scrape data for any newly listed cars.
<br>
<img src='screenshots\automating_repo_update.png'>

<br>
<br>

***

<br>
<br>

#### Expected Price from CarGurus and Zip Codes
Example of the form used to request the expected car price from CarGurus using their API.
<img src="screenshots\cargurus.png">

All code to get the expected price is commented out because it takes too long to run. Code also found in the GetExpectedPrice.ipynb notebook.

In [None]:
# import requests
# import pandas as pd
# import os
# import numpy as np
# import bs4
# import time

In [None]:
# # Read in scraped car data
# data_file = os.path.join(os.getcwd(),"data","all_cars.csv") 
# cars = pd.read_csv(data_file)
# pd.set_option("display.max_columns",None) 
# cars['mileage'] = cars['mileage'].astype('Int64')
# cars['year'] = cars['year'].astype('Int64')
# cars.info()

In [None]:
# # Create dictionary of zip codes for all towns to look up expected price
# # Website to look up zipcodes
# # http://localistica.com/usa/ut/salt%20lake%20city/zipcodes/all-zipcodes/

# def get_most_populated_zip_code(city):
#     try:
#         r = requests.get(f'http://localistica.com/search.aspx?q={city.lower().replace(" ", "+")}')
#         url = bs4.BeautifulSoup(r.text).find("a", id="ctl09_hlZipCodesCount")['href']
#         return int(bs4.BeautifulSoup(requests.get(url).text).find(id="dgZipCodes").find_all("tr")[1].td.a.text)
#     except:
#         return None
    

# # get unique cities in dataframe
# all_cities = [car for car in cars.city.unique() if type(car) == str]
# keyList = [x + ", " + cars[cars.city == x].iloc[0]['state'] for x in all_cities]
# # look up zipcode
# zip_codes = {key: get_most_populated_zip_code(key + " " + cars[cars.city == key].iloc[0]['state']) for key in all_cities}
# # hard code the ones it missed
# zip_codes.update({'St. Anthony': 83445, 'Provo Canyon': 84604})
# # check for missing zip codes
# print(len([k for k,v in zip_codes.items() if v == None]))
# zip_codes

In [None]:
# # list of all cars in CarGurus database
# all_cars = requests.get("https://www.cargurus.com/Cars/getCarPickerReferenceDataAJAX.action?showInactive=false&useInventoryService=false&quotableCarsOnly=false&localCountryCarsOnly=true&outputFormat=REACT").json()

# # gets CarGuru make and model id to find price for individual make and model
# def get_cargurus_maker_and_model_ids(all_cars, car_make, car_model):
#     try:
#         all_models = [x for x in all_cars.get('allMakerModels').get('makers') if x.get('name') == car_make][0]
#         return (all_models.get('id'), [x for x in all_models.get('models') if x.get('name') == car_model][0].get('id'))
#     except (IndexError, AttributeError):
#         return (None, None)
    
# # gets the entity id which includes the make, model, and year of car
# def get_entity_id(maker_id, model_id, car_year):
#     try:
#         all_entities = requests.get(f"https://www.cargurus.com/Cars/getSelectedMakerModelCarsAJAX.action?showInactive=false&useInventoryService=false&quotableCarsOnly=false&localCountryCarsOnly=true&outputFormat=REACT&maker={maker_id}").json()
#         model_entity_ids = [car for car in all_entities.get('models') if car.get('id') == model_id][0]
#         return [ids for ids in model_entity_ids.get('cars') if ids.get('year') == car_year][0].get('id')
#     except (IndexError, AttributeError):
#         return None
    
# # gets the estimated listing price of the car based on entity id and the mileage
# def get_price(car_make, car_model, car_year, car_mileage, car_zip_code, all_cars):
#     maker_id, model_id = get_cargurus_maker_and_model_ids(all_cars, car_make, car_model)
#     if not model_id or pd.isna(car_mileage):
#         return None
    
#     entity_id = get_entity_id(maker_id, model_id, car_year)
    
#     if not entity_id:
#         return None 
#     # data needed to request CarGurus report
#     data = {
#         'carDescription.radius': 75,
#         'selectedEntity': entity_id,
#         'carDescription.transmission': "",
#         'carDescription.mileage': car_mileage,
#         'carDescription.postalCode': car_zip_code,
#         'carDescription.engineId': "",
#         'carDescription.vin': "",
#         'carDescription.vinType': "",
#         'forPrivateListing': True,
#         'inventoryListingId' : ""
#     }
    
#     res = requests.post("https://www.cargurus.com/Cars/generateReportJsonAjax.action", data=data)
#     res.raise_for_status()
#     try:
#         return res.json().get("priceDetails").get("privateListingPrice") #private listing price from CarGurus report
#     except AttributeError:
#         raise Exception(res.json())


# for index, row in cars.iterrows():
#     if row["city"] in zip_codes.keys():
#         try:
#             expected_price = get_price(row["make"], row["model"], row["year"], row["mileage"], zip_codes.get(row["city"]), all_cars)
#         except Exception as e:
#             print(e)
#             time.sleep(5)
#             continue
#         # change expected prices that are 0 to none
#         if expected_price == None or expected_price < 1:
#             expected_price = None 
#             print(index)
#         cars.loc[index, "expected_price"] = expected_price
#     else:
#         cars.loc[index, "expected_price"] = None

# # Add zip codes to a column
# for index, row in cars.iterrows():
#     cars.loc[index, "zip_code"] = zip_codes.get(row["city"])
    
# # save cars dataframe to pickle
# cars.to_pickle("./cars.pkl")

# pickle_cars.info() # 36,325 expected prices

### Data Processing

Each KSL listing page has a fairly consistent format making scraping feasible for the large number of pages we analyzed. When creating a listing, the user is required to list the year, VIN, make, model, body style, mileage, title type, asking price, and location. Together, these are the only features we can guarantee to extract from each page. Of course, many listings have many more details listed than these which we also incorporated into our analysis (as shown in the example listing above).

As mentioned above, data was scraped with BeautifulSoup and structured into a pandas dataframe. Numerical variables for categorical variables were generated and concatenated to this dataframe to facilitate use of these variables. Subsequent processing was done using built-in pandas masking to get relevant rows from the dataframe for new queries when searching recently listed used cars.

### Exploratory Analysis

We visualized our data in multiple ways to check our data scraping procedures and make sure we did not incorrectly classify our data. The first basic check we did was scroll through the data frame for any obvious errors using the display command. We then used the describe command to look at the descriptive statistics of each column in our dataframe. Next we visualized the data using a scatterplot matrix in order to check the histograms of each parameter for outliers and general trends. We also used the scatterplot matrix to explore correlations between different parameters. We also visualized a heat map of the correlation matrix to determine which parameters were strongly correlated. This information was used to identify potential strong predictors for the multiple linear regression and determine if any parameters were potential confounders.

Basic outline:
1. Scroll through the dataframe for any obvious errors
2. Describe command to verify the descriptive statistics
3. Visualize using the scatterplot matrix to explore histograms of each parameter and look for outliers/trends
4. Heat map of the correlation matrix to determine which parameters are strongly correlated to identify potential strong predictors for the multiple linear regression or potential confounders

In [None]:
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

In [None]:
# Read in scraped car data
data_file = os.path.join(os.getcwd(),"data","all_cars.csv")
cars = pd.read_csv(data_file)
# cars = pd.read_pickle('cars.pkl')
pd.set_option("display.max_rows",None,"display.max_columns",None) 

# Recast data
cars['mileage'] = cars['mileage'].astype('float')
cars['year'] = cars['year'].astype('float')

# Clean price data and recast
ugly_cars = cars[cars['price'].str.contains('MSRP')]
ugly_cars.index
for index, car in ugly_cars.iterrows():
    if '|' not in car['price']:
        cars.at[index,'price'] = None
    else:
        cars.at[index,'price'] = car['price'].split('|')[0].strip()
cars['price'] = cars['price'].astype('float')

#### Descriptive Statistics

In [None]:
display(cars.describe())
display(cars.info())

In [None]:
# Data clean up
# Check year
plt.figure(figsize=(15,5))
plt.subplot(121)
cars["year"].plot.hist(bins=100)
plt.title('Original Year Histogram')
plt.xlabel('Year')
# any cars with a year less than 1920 changed to None
cars.loc[(cars.year < 1920),'year']=None 
plt.subplot(122)
cars["year"].plot.hist(bins=100)
plt.title('Cleaned Year Histogram')
plt.xlabel('Year')

# Check price
plt.figure(figsize=(15,5))
plt.subplot(121)
cars["price"].plot.hist(bins=100)
plt.title('Original Price Histogram')
plt.xlabel('Price ($)')
# any cars with a price less than $100 or greater than $500,000 changed to None
cars.loc[(cars.price < 100),'price']=None 
# expcars = cars['price'] > 500000
# expcars = cars[expcars]
# display(expcars)
cars.loc[(cars.price > 500000),'price']=None 
plt.subplot(122)
cars["price"].plot.hist(bins=100)
plt.title('Cleaned Price Histogram')
plt.xlabel('Price ($)')

# Check Mileage
plt.figure(figsize=(15,5))
plt.subplot(121)
cars["mileage"].plot.hist(bins=100)
plt.title('Original Mileage Histogram')
plt.xlabel('Miles')
# any cars with a mileage greater than 500000 changed to None
cars.loc[(cars.mileage > 500000),'mileage']=None 
plt.subplot(122)
cars["mileage"].plot.hist(bins=100)
plt.title('Cleaned Mileage Histogram')
plt.xlabel('Miles')

# Check Liters
plt.figure(figsize=(15,5))
plt.subplot(121)
cars["liters"].plot.hist(bins=100)
plt.title('Original Liters Histogram')
plt.xlabel('Engine Size (Liters)')
# any cars with a liters greater than 100 changed to None
cars.loc[(cars.liters > 100),'liters']=None 
plt.subplot(122)
cars["liters"].plot.hist(bins=100)
plt.title('Cleaned Liters Histogram')
plt.xlabel('Engine Size (Liters)')

# Check fav_per_view
# any cars with inf or Nan changed to None
cars.loc[(cars.fav_per_view > 2),'fav_per_view']=None 

print('Cleaned Descriptive Statistics')
display(cars.describe())

In [None]:
# Explore Categorical Variables
print(cars['body'].value_counts(), '\n')
print(cars['title_type'].value_counts(), '\n')
print(cars['seller'].value_counts(), '\n')
#print(cars['ext_color'].value_counts(), '\n')
print(cars['transmission'].value_counts(), '\n')
print(cars['fuel_type'].value_counts(), '\n')
print(cars['ext_condition'].value_counts(), '\n')
print(cars['int_condition'].value_counts(), '\n')
print(cars['drive_type'].value_counts(), '\n')

# create numerical categorical variables
print('\n Title types')
# set clean titles to 1 and other titles to 0
cars["title_num"] = cars["title_type"].map({'Rebuilt/Reconstructed Title':0, 'Salvage Title':0, 'Dismantled Title':0, 'Clean Title':1})
print(cars['title_num'].value_counts(), '\n')

print('Seller types')
cars["seller_num"] = cars["seller"].map({'Dealer':0, 'Owner':1})
print(cars['seller_num'].value_counts(), '\n')

print('Transmission types')
cars["transmission_num"] = cars["transmission"].map({'Automatic':0, 'Automanual':0, 'Manual':1, 'CVT':0})
print(cars['transmission_num'].value_counts(), '\n')

print('Drive types: 1 = 4 Wheel Drive, 0 = 2 Wheel Drive')
cars["drive_num"] = cars["drive_type"].map({'4-Wheel Drive':1, '2-Wheel Drive':0, 'FWD':0, 'AWD':1})
print(cars['drive_num'].value_counts(), '\n')

#### Data Cleaning Interpretation
Looking at the histograms and descriptive statistics we had to clean the data and remove outliers. We changed the data to missing if the car year was older than 1920, the price was less than 100 dollars or greater than 500000 dollars, the mileage was greater than 500,000 miles, or the liters was greater than 100. This cleaned up most of the outliers and the describe() values look much more realistic. We then converted the categorical variables of title_type, seller, transmission, and drive_type to numerical categorical variables of 0 or 1 so that we could include them in our regression analysis.

#### Scatter Matrix & Correlation Matrix for All Cars

In [None]:
# Check histograms and scatter matrix for outliers
# Scatter Matrix
scatter_matrix(cars[["price", "year", "mileage", "liters", "cylinders", "n_doors", "n_pics",'views','favorites','view_rate','favorite_rate']], alpha = 0.2, figsize=(15, 10))
print()

# Correlation Matrix
# automatically ignores missing values
fig = plt.figure(figsize=(15,10));
ax = fig.add_subplot(111)
plt.pcolor(cars.corr(),vmin=-1,vmax=1, cmap=plt.cm.get_cmap('BrBG'))
labels = ['lastpull_ts','price','year','mileage','liters','cylinders','n_doors','n_pics','expected_price','zip_code','views','favorites','workingURL','view_rate','favorite_rate','fav_per_view','title_num', 'seller_num', 'transmission_num', 'drive_num'] 
plt.xticks([i+0.5 for i in range(len(labels))],labels=labels,rotation=90)
plt.yticks([i+0.5 for i in range(len(labels))],labels=labels)
plt.title("Correlation Matrix");
plt.colorbar();
display(cars.corr())

#### Data Exploration Interpretation for All Cars
The scatter matrix and correlation matrix for all the cars show that the strongest correlations with price are mileage ~ -0.56, year ~ 0.47, and drive_num ~ 0.42. All of these variables are what we would expect to correlate. Other variables that may be confounders because they correlate with each other are cylinders x liters, views x favorites, views x view_rate, views x favorite_rate, favorites x favorite_rate which also make sense that they would be correlated because they are so closely related.

### Analysis Methodology

#### Regression
Aim: predict the price of a newly-listed used car
<br>
Dependent variable:list price 
<br>
Possible independent variables: year, seller type (dealer/private), mileage, color, title (clean/salvaged), transmission type, cylinders, fuel type, number of doors, exterior/interior condition, listing date, page views per day. 
<br>
We will use the Python package statsmodels to perform all regression analyses. 
1. Multiple linear regression first using the parameters that had strong correlations with list price. 
2. Based off of this initial model we will adjust our multiple linear regression to only include parameters that have significant p-values for their individual coefficients. 
Significance level: 𝝰=0.05

Expected Outcomes:
* Our final model should have a p-value < 0.05 for the F-statistic of the overall model. 
* We are aiming to explain at least 70% of the variance with our model and hope to get an R-squared value of 0.70 or more.

In [None]:
# multiple linear regression
# set missing option to drop any missing data
print('Multiple linear regression for all variables across all cars')
df_all_ols = sm.ols(formula="price ~ year + mileage + liters + cylinders + n_doors + n_pics + views + favorites + view_rate + favorite_rate + fav_per_view", data=cars, missing='drop').fit()
display(df_all_ols.summary())

# remove correlated variables
# correlated variables - views x favorites, views x view_rate, favorites x favorite_rate, views x favorite_rate, view_rate x favorite_rate
print('Multiple linear regression with highly correlated variables removed (favorites and views and their rates)')
df_all_ols = sm.ols(formula="price ~ year + mileage + liters + cylinders + n_doors + n_pics + favorites + fav_per_view", data=cars, missing='drop').fit()
display(df_all_ols.summary())

# remove insignificant variables
print('Multiple linear regression with significant predictor variables')
df_all_ols = sm.ols(formula="price ~ year + mileage + liters + cylinders + fav_per_view", data=cars, missing='drop').fit()
display(df_all_ols.summary())

# remove insignificant variables
print('Multiple linear regression with significant predictor variables round 2')
df_all_ols = sm.ols(formula="price ~ year + mileage + liters + fav_per_view", data=cars, missing='drop').fit()
display(df_all_ols.summary())

# Compare expected price from cargurus to actual list price
print('Simple linear regression of expected price')
df_ols = sm.ols(formula="price ~ expected_price", data=cars, missing='drop').fit()
display(df_ols.summary())

plt.figure(figsize=(15,10))
sns.regplot(x='expected_price', y='price', data=cars)
plt.title('All Cars Simple Linear Regression')
plt.show()

In [None]:
# Find the most common car make
cars['make_model'] = cars['make'] + ' ' + cars['model']

print('Most common models of cars:')
print(cars['make_model'].value_counts()[:5].sort_values(ascending=False))
# the 1500 model includes Ram, GMC so second most common car is Silverado 1500

# Find all Ford F150, most popular car
mask = cars['make_model'] == 'Ford F-150'
F150 = cars[mask]

In [None]:
# Ford F150
# Scatter Matrix
scatter_matrix(F150[["price", "year", "mileage", "liters", "cylinders", "n_doors", "n_pics",'views','favorites','view_rate','favorite_rate']], alpha = 0.2, figsize=(15, 10))
print()

# Correlation Matrix
# automatically ignores missing values
fig = plt.figure(figsize=(15,10));
ax = fig.add_subplot(111)
plt.pcolor(F150.corr(),vmin=-1,vmax=1, cmap=plt.cm.get_cmap('BrBG'))
labels = ['lastpull_ts','price','year','mileage','liters','cylinders','n_doors','n_pics','expected_price','zip_code','views','favorites','workingURL','view_rate','favorite_rate','fav_per_view','title_num', 'seller_num', 'transmission_num', 'drive_num'] # check labels for final df
plt.xticks([i+0.5 for i in range(len(labels))],labels=labels,rotation=90)
plt.yticks([i+0.5 for i in range(len(labels))],labels=labels)
plt.title("Correlation Matrix");
plt.colorbar();
display(F150.corr())

In [None]:
# Ford F150 Multiple Linear Regression
print('Multiple linear regression Ford F150')
df_ols = sm.ols(formula="price ~ year + mileage + liters + cylinders + n_doors + n_pics + views + favorites + title_num + seller_num + transmission_num + drive_num", data=F150, missing='drop').fit()
display(df_ols.summary())
print('We explained 85% of the variance with this initial model including all variables \n \n')

print('Multiple linear regression Ford F150 remove insignificant variables')
df_ols = sm.ols(formula="price ~ year + mileage + liters + cylinders + n_pics + views + favorites + drive_num", data=F150, missing='drop').fit()
display(df_ols.summary())
print('We explained 76% of the variance with our second model that doesn not include variables that were not signficiant in the previous model \n \n')

print('Simple linear regression of expected price Ford F150')
df_ols = sm.ols(formula="price ~ expected_price", data=F150, missing='drop').fit()
display(df_ols.summary())
print('70% of the variance is explained when comparing the expected price to the list price. The model we developed above is explains more variance than the CarGurus expected price. \n \n')



In [None]:
# Second most commont type of car

# Find all Chevy Silverado 1500, second most popular car
mask = cars['make_model'] == 'Chevrolet Silverado 1500'
silverado_1500 = cars[mask]

# Chevy Silverado 1500
# Scatter Matrix
scatter_matrix(silverado_1500[["price", "year", "mileage", "liters", "cylinders", "n_doors", "n_pics",'views','favorites','view_rate','favorite_rate']], alpha = 0.2, figsize=(15, 10))
print()

# Correlation Matrix
# automatically ignores missing values
fig = plt.figure(figsize=(15,10));
ax = fig.add_subplot(111)
plt.pcolor(silverado_1500.corr(),vmin=-1,vmax=1, cmap=plt.cm.get_cmap('BrBG'))
labels = ['lastpull_ts','price','year','mileage','liters','cylinders','n_doors','n_pics','expected_price','zip_code','views','favorites','workingURL','view_rate','favorite_rate','fav_per_view','title_num', 'seller_num', 'transmission_num', 'drive_num'] # check labels for final df
plt.xticks([i+0.5 for i in range(len(labels))],labels=labels,rotation=90)
plt.yticks([i+0.5 for i in range(len(labels))],labels=labels)
plt.title("Correlation Matrix");
plt.colorbar();
print('Correlations for the Chevy Silverado 1500')
display(silverado_1500.corr())

In [None]:
# Silverado 1500 Multiple Linear Regression
print('Multiple linear regression Chevy Silverado 1500')
df_ols = sm.ols(formula="price ~ year + mileage + liters + cylinders + n_doors + n_pics + views + favorites + title_num + seller_num + transmission_num + drive_num", data=silverado_1500, missing='drop').fit()
display(df_ols.summary())
print(' We explained 94% of the variance when including all variables in this model to predict the list price of Chevy Silverado 1500. \n \n')

print('Multiple linear regression Chevy Silverado 1500 remove insignificant variables')
df_ols = sm.ols(formula="price ~ year + mileage + liters + drive_num", data=silverado_1500, missing='drop').fit()
display(df_ols.summary())
print(' Our model explains 89% of the variance after removing the unsiginificant variables. \n \n')

print('Simple linear regression of expected price Chevy Silverado 1500')
df_ols = sm.ols(formula="price ~ expected_price", data=silverado_1500, missing='drop').fit()
display(df_ols.summary())
print('77% of the variance is explained when comparing the expected price to the list price. The model we developed above is explains more variance than the CarGurus expected price. \n \n')



#### Regression Interpretation
For the most popular car, the Ford F-150, we were able to make a multiple linear regression model that explained 76% of the variance in list price using the independent variables of year, mileage, liters, cylinders, n_pics, views, favorites, and drive (4-wheel drive vs 2-wheel drive). This explained more of the variance than just comparing the list price to the car gurus expected price.

For the second most popular car, the Chevy Silverado 1500, our multiple linear regression model explained 89% of the variance in list price using the variables year, mileage, liters, and drive. It is interesting that this model explained more of the variance with less variables. When we included all of the variables we were able to explain 94% of the variance, but this model is probably overfit. 

These regression models for individual cars explain a lot of the variance in this dataset and could be used to predict the expected price of cars on ksl based on their year, mileage, liters, cylinders, and drive. These expected list prices could be compared to the list price to alert potential buyers to good/bad deals instead of using a website like CarGurus.

#### Clustering
We plan to cluster what we classify as a “good deal” in its respective geographical location and create clusters showing areas in Utah where cars are generally sold for a good deal. We're going to create a heat map that displays the average percent difference between the cargurus expected price and the list price to find geographical locations of good deals.

In [None]:
search = SearchEngine()

In [None]:
# Add in a city, state column
cars['citystate_abb'] = cars[['city', 'state']].apply(lambda x: ', '.join(x.astype(str)), axis=1)
# Combine make and model for new column
cars['make_model'] = cars[['make', 'model']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

# only get non-null expected price rows
good_cars = cars[cars.expected_price.notnull()]

In [None]:
# load zip code coordinates
with open(os.path.join(os.getcwd(),'maps','zip_coord_api.pkl'),'rb') as handle:
    zip_coord = pickle.load(handle)

In [None]:
# reverse the ZIP:lat/long dictionary
coor_zip = {str(round(v[0], 3))+str(round(v[1], 3)): k for k, v in zip_coord.items()}

if len(coor_zip) != len(zip_coord):
    raise ValueError('the reversed ZIP code dictionary had a different number of keys than the original')

In [None]:
# group the data by ZIP code and generate some summary statistics for the map
good_cars = good_cars.copy()
good_cars['price_diff'] =  good_cars['price'] / good_cars['expected_price']
good_cars['price_cat'] = pd.qcut(good_cars['price_diff'], 3, labels=["green", "orange", "red"])
good_cars['price_abs_cat'] = pd.qcut(good_cars['price'], 3, labels=["green", "orange", "red"])
good_cars['price_label'] = pd.qcut(good_cars['price_diff'], 3, labels=["Good", "Average", "Bad"])
good_cars['price_abs_label'] = pd.qcut(good_cars['price'], 3, labels=["Cheap", "Average", "Expensive"])

good_cars['zip_code'] = good_cars['zip_code'].astype(int)
zip_group = good_cars.groupby('zip_code')

zip_group_mode = zip_group.agg(pd.Series.mode)
zip_group_median = zip_group.agg(pd.Series.median)
zip_group_count = zip_group.agg('count').iloc[:,0]
zip_group_count_log = np.floor(np.log(zip_group_count)+2)

##### Generate an interactive map that categorizes how good city prices are based on $\frac{KSL\,listed\,price}{CarGurus\,expected\,price}$

* Color encodes whether cars in that area are generally a good deal
    * Red = bad
    * Orange = average
    * Green = good
* Circle size encodes how many cars are listed for a given location
* Click on each colored circle to display summary statistics for a given city

In [None]:
# Generate an interactive map that categorizes how good city prices are based on the ratio of KSL listed price over CarGurus expected price

utah_center = [39.3210, -111.0937]
zoom = 6
m = Map(basemap=basemaps.OpenStreetMap.Mapnik, center=utah_center, zoom=zoom)
m.add_control(FullScreenControl())

def click_disp(event, type, coordinates):
    # use coordinates to reverse look up data associated with those coordinates
    try:
        currzip = coor_zip[str(round(coordinates[0], 3))+str(round(coordinates[1], 3))]
    except:
        currzip = int(search.by_coordinates(40.4, -112, radius=30, returns=1)[0].zipcode)
    currdata = good_cars.loc[good_cars['zip_code'] == currzip]
    currlabel = zip_group_mode.loc[currzip, 'citystate_abb']
    currtotal = "{:,}".format(zip_group_count.loc[currzip])
    currmedprice = "{:,}".format(int(zip_group_median.loc[currzip, 'price']))
    currmedmile = "{:,}".format(int(zip_group_median.loc[currzip, 'mileage']))
    currmedyear = str(int(zip_group_median.loc[currzip, 'year']))
    if isinstance(zip_group_mode.loc[currzip, 'make_model'], str):
        currcommcar = zip_group_mode.loc[currzip, 'make_model']
        mid_msg = '<tr><td>Most Common Car:&emsp;</td><td>' + currcommcar + '</td></tr>'
    else:
        currcommcar = zip_group_mode.loc[currzip, 'make_model'][0]
        mid_msg = '<tr><td>Most Common Car:&emsp;</td><td>' + currcommcar + '</td></tr>'
        for currcommcar in zip_group_mode.loc[currzip, 'make_model'][1:]:
            mid_msg += '<tr><td>&emsp;</td><td>' + currcommcar + '</td></tr>'
    if isinstance(zip_group_mode.loc[currzip, 'price_label'], str):
        currdeal = zip_group_mode.loc[currzip, 'price_label']
    else:
        currdeal = zip_group_mode.loc[currzip, 'price_label'][0]
    
    # remove old popup layer
    if isinstance(m.layers[-1], Popup):
        m.remove_layer(m.layers[-1])

    # add a popup layer on hover over a city
    message = HTML()
    
    upper_msg = ('<h4><strong>' + currlabel + '</strong></h4>' +
                     '<table>' +
                         '<tr><td>Total Cars:&emsp;</td><td>' + currtotal + '</td></tr>' +
                         '<tr><td>Median Price:&emsp;</td><td>$' + currmedprice + '</td></tr>' +
                         '<tr><td>Median Mileage:&emsp;</td><td>' + currmedmile + '</td></tr>' +
                         '<tr><td>Median Year:&emsp;</td><td>' + currmedyear + '</td></tr>')
    lower_msg = ('<tr><td>Overall Market:&emsp;</td><td>' + currdeal + '</td></tr>' +
                     '</table>')

    message.value = upper_msg + mid_msg + lower_msg
    
    popup = Popup(location=coordinates, child=message, close_button=False, auto_close=True, close_on_escape_key=False)

    m.add_layer(popup) # add the new layer

# create a layer group 
layer_group = LayerGroup()
for zipp, coord in zip_coord.items():
    circle = CircleMarker()
    circle.location = coord
    circle.radius = int(zip_group_count_log[zipp])
    circle.weight = 2
    circle.opacity = 0.8
    if isinstance(zip_group_mode.loc[zipp,'price_cat'], str):
        color = zip_group_mode.loc[zipp,'price_cat']
    else:
        color = zip_group_mode.loc[zipp,'price_cat'][0]
    circle.color = color
    circle.fill_color = color
    circle.fill_opacity = 0.3
    layer_group.add_layer(circle)
    circle.on_click(click_disp)

    

m.add_layer(layer_group)
m

##### Generate a different interactive map that categorizes how good city prices are based on all KSL listed prices

* Color encodes how cars in that area generally compare to others on KSL
    * Red = expensive
    * Orange = average
    * Green = cheap
* Circle size encodes how many cars are listed for a given location
* Click on each colored circle to display summary statistics for a given city

In [None]:
# Generate a different interactive map that categorizes how good city prices are based on all KSL listed prices

utah_center = [39.3210, -111.0937]
zoom = 6
m2 = Map(basemap=basemaps.OpenStreetMap.Mapnik, center=utah_center, zoom=zoom)
m2.add_control(FullScreenControl())

def click_disp(event, type, coordinates):
    # use coordinates to reverse look up data associated with those coordinates
    try:
        currzip = coor_zip[str(round(coordinates[0], 3))+str(round(coordinates[1], 3))]
    except:
        currzip = int(search.by_coordinates(40.4, -112, radius=30, returns=1)[0].zipcode)
    currdata = good_cars.loc[good_cars['zip_code'] == currzip]
    currlabel = zip_group_mode.loc[currzip, 'citystate_abb']
    currtotal = "{:,}".format(zip_group_count.loc[currzip])
    currmedprice = "{:,}".format(int(zip_group_median.loc[currzip, 'price']))
    currmedmile = "{:,}".format(int(zip_group_median.loc[currzip, 'mileage']))
    currmedyear = str(int(zip_group_median.loc[currzip, 'year']))
    if isinstance(zip_group_mode.loc[currzip, 'make_model'], str):
        currcommcar = zip_group_mode.loc[currzip, 'make_model']
        mid_msg = '<tr><td>Most Common Car:&emsp;</td><td>' + currcommcar + '</td></tr>'
    else:
        currcommcar = zip_group_mode.loc[currzip, 'make_model'][0]
        mid_msg = '<tr><td>Most Common Car:&emsp;</td><td>' + currcommcar + '</td></tr>'
        for currcommcar in zip_group_mode.loc[currzip, 'make_model'][1:]:
            mid_msg += '<tr><td>&emsp;</td><td>' + currcommcar + '</td></tr>'
    if isinstance(zip_group_mode.loc[currzip, 'price_abs_label'], str):
        currdeal = zip_group_mode.loc[currzip, 'price_abs_label']
    else:
        currdeal = zip_group_mode.loc[currzip, 'price_abs_label'][0]
        
    # remove old popup layer
    if isinstance(m2.layers[-1], Popup):
        m2.remove_layer(m2.layers[-1])

    # add a popup layer on hover over a city
    message = HTML()
    
    upper_msg = ('<h4><strong>' + currlabel + '</strong></h4>' +
                     '<table>' +
                         '<tr><td>Total Cars:&emsp;</td><td>' + currtotal + '</td></tr>' +
                         '<tr><td>Median Price:&emsp;</td><td>$' + currmedprice + '</td></tr>' +
                         '<tr><td>Median Mileage:&emsp;</td><td>' + currmedmile + '</td></tr>' +
                         '<tr><td>Median Year:&emsp;</td><td>' + currmedyear + '</td></tr>')
    lower_msg = ('<tr><td>Overall Market:&emsp;</td><td>' + currdeal + '</td></tr>' +
                     '</table>')

    message.value = upper_msg + mid_msg + lower_msg
    
    popup = Popup(location=coordinates, child=message, close_button=False, auto_close=True, close_on_escape_key=False)

    m2.add_layer(popup) # add the new layer

# create a layer group 
layer_group = LayerGroup()
for zipp, coord in zip_coord.items():
    circle = CircleMarker()
    circle.location = coord
    circle.radius = int(zip_group_count_log[zipp])
    circle.weight = 2
    circle.opacity = 0.8
    if isinstance(zip_group_mode.loc[zipp,'price_abs_cat'], str):
        color = zip_group_mode.loc[zipp,'price_abs_cat']
    else:
        color = zip_group_mode.loc[zipp,'price_abs_cat'][0]
    circle.color = color
    circle.fill_color = color
    circle.fill_opacity = 0.3
    layer_group.add_layer(circle)
    circle.on_click(click_disp)

    

m2.add_layer(layer_group)
m2

### Project Schedule

#### February 24th - 28th
* Check data accessibility (robots.txt and terms of conditions) 
* Basic info due Wed Feb 26th
* Project Proposal due Fri Feb 28th
#### March 2nd - 6th
* Download html files for all recent listings from ksl
* Begin data scraping and create one dataframe with each row as a listing
* Get/give peer feedback March 5th
* Written feedback from staff by March 8th
#### March 9th - 13th (Spring Break)
* Finish data scraping 
* Exploratory analysis
* Describe 

#### March 16th - 20th
* Exploratory analysis
    * Scatter Matrix
        * Interpret histograms - check if there are any outliers that could be an error from scraping
        * Interpret correlations
    * Heatmap of Correlation Matrix
        * Interpret Correlations

#### March 23rd - 27th
* Write up project milestone
* Project milestone due March 29th 
* Acquired, cleaned data, EDA, Sketches of your analysis methods, Submit zip file with Jupyter Notebook, data, other resources.

#### March 30th - April 3rd 
* Get staff feedback
* Begin testbed for good deal predictions based on relation to scraped historical dataset

#### April 6th - April 10th
* Finalize predictive model for new listings
* Script and film project video

#### April 13th - 17th
* Polish up repository in preparation for final submission
* Edit and finalize project video
* Project Due Sunday April 19th
* Project awards April 21st



### Peer Feedback
Our Reviewers: Kyle Cornwall, Shushanna Mkrtychyan

* This is pretty similar to cargurus.com and KBB. How is this different than those existing sites?

* How do you know if a car has been in an accident?

* Look for granularity of NADAguides and devise ways that we can "beat" that model.

* Consider doing feature transformation when doing regression.

* Can you enhance the dataset with some other website?

* What features do other car valuation websites use to generate their price predictions?

* Can you get Carfax info from VIN? (without breaking the bank)

* Three potential classes when predicting a value (good, average, bad)

* Might need to downselect the number of cars we can predict prices for since our dataset size could be limited (i.e. top 20 most frequent cars)

### Video

Add link to final video