# Automodeals

### Basic Information

Title: KSL AutomoDeals
<br>
Names: Chantel Charlebois, Taylor Hansen, Michael Paskett
<br>
E-mails: chantel.charlebois@utah.edu, taylor.c.hansen@utah.edu, michael.paskett@utah.edu
<br>
UIDS: u1043299, u0642850, u1144000

### Background and Motivation

Almost every person in Utah (and some neighboring states) buying a used car will visit KSL Cars classifieds to look for their new wheels. There are few resources for understanding the rough value of a used car, such as Kelly Blue Book (kbb.com), but such services cannot fully integrate the complex auto market of a local area. By storing and analyzing the prices, details, and options for a certain model or class of vehicle, a prospective buyer can evaluate how good the listed price for a vehicle actually is. With such a model, the user can estimate how much a specific car is really worth, and determine if the vehicle is worthy of a test drive.

### Project Objectives

Questions:
* How well can we predict the price of a newly-listed car based on the attributes available in an advertisement?
* Which attributes are most influential in determining the vehicle price?
* What areas have the best price of cars?
Aims:
* Create a regression model that will predict the expected price of a car based on several attributes, such as:
    * year,  seller type (dealer, private), mileage, color, title (clean/salvaged), transmission type, cylinders, fuel type, number of doors, exterior/interior condition, listing date, page views per day (when a listing has reached 7 days)
* Create a clustered map of “good deals” in different regions

Benefits:
* This project could benefit anyone in the market for a used car, helping them to be informed of the potential value of a car they are interested in.

### Data

We scraped our data from queries of used cars listed on [cars.ksl.com](https://cars.ksl.com/). We have read the robots.txt files for both ksl.com and cars.ksl.com to confirm that there are no restrictions for crawling their website. We have also reviewed the terms and conditions and have similarly found no indication for rules on crawling. There used to be an undocumented API for interfacing with KSL (as of four years ago), but it is no longer publicly accessible, so we will be manually scraping with BeautifulSoup.

To avoid consistently using too much bandwidth on their website, we started with “historical” data collection by saving .html pages over the course of the project so that we can parse them offline. However, we now dynamically scrape data and the information we scrape is dynamically loaded by JavaScript, the html files were bloated with other associated files which would be too unwieldy to download on a daily basis (on the order of 5,000 cars/day). Instead, we opted for a live crawler approach to get our data.

We used CarGurus website https://www.cargurus.com/Cars/sell-car/?pid=SellMyCarDesktopHeader and their API to request an expected car price based on each cars Make, Model, Mileage, Year, and Zip Code. 

### Ethical Considerations

Stakeholders:
* The creators (us)
* The seller
* The prospective buyer
* KSL

Our incentive as creators and prospective buyers is to find good deals without having to manually spend hours searching through KSL for a good deal. For other prospective buyers, the same applies. The sellers have competing interests, as they would like to sell their car for as much as possible. KSL also has a stake in this project, as it makes revenue from ads and from sellers paying for better listings in order to make their vehicle more prominent.
We anticipate that other ethical considerations may arise as the project progresses and details are worked out.b

### Data Processing

Each listing page has a fairly consistent format making scraping feasible for the large number of pages we will be analyzing. The quantities we plan to derive from our data have been listed above in the Project Objectives section. When creating a listing, the user is required to list the year, VIN, make, model, body style, mileage, title type, asking price, and ZIP code. A timestamp is also associated with each listing. Together, these are the only features we can guarantee to extract from each page. Of course, many listings have many more details listed than these which we can and plan to use.

As mentioned above, data will be scraped  with BeautifulSoup and will be structured into a pandas dataframe. Dummy variables for categorical variables may be generated and concatenated to this dataframe to facilitate use of these variables. Subsequent processing will be done using built-in pandas masking to get relevant rows from the dataframe for new queries when searching recently listed used cars.

#### Data Scraping
1. KSL - main car infor, favorites, views
2. Expected price from CarGurus and Zip Codes

#### KSL

#### Expected Price from CarGurus and Zip Codes

In [3]:
# import requests
# import pandas as pd
# import os
# import numpy as np
# import bs4
# import time

In [4]:
# # Read in scraped car data
# data_file = os.path.join(os.getcwd(),"data","all_cars.csv") 
# cars = pd.read_csv(data_file)
# pd.set_option("display.max_columns",None) 
# cars['mileage'] = cars['mileage'].astype('Int64')
# cars['year'] = cars['year'].astype('Int64')
# cars.info()

In [5]:
# # Create dictionary of zip codes for all towns to look up expected price
# # Website to look up zipcodes
# # http://localistica.com/usa/ut/salt%20lake%20city/zipcodes/all-zipcodes/

# def get_most_populated_zip_code(city):
#     try:
#         r = requests.get(f'http://localistica.com/search.aspx?q={city.lower().replace(" ", "+")}')
#         url = bs4.BeautifulSoup(r.text).find("a", id="ctl09_hlZipCodesCount")['href']
#         return int(bs4.BeautifulSoup(requests.get(url).text).find(id="dgZipCodes").find_all("tr")[1].td.a.text)
#     except:
#         return None
    

# # get unique cities in dataframe
# all_cities = [car for car in cars.city.unique() if type(car) == str]
# keyList = [x + ", " + cars[cars.city == x].iloc[0]['state'] for x in all_cities]
# # look up zipcode
# zip_codes = {key: get_most_populated_zip_code(key + " " + cars[cars.city == key].iloc[0]['state']) for key in all_cities}
# # hard code the ones it missed
# zip_codes.update({'St. Anthony': 83445, 'Provo Canyon': 84604})
# # check for missing zip codes
# print(len([k for k,v in zip_codes.items() if v == None]))
# zip_codes

In [7]:
# # list of all cars in CarGurus database
# all_cars = requests.get("https://www.cargurus.com/Cars/getCarPickerReferenceDataAJAX.action?showInactive=false&useInventoryService=false&quotableCarsOnly=false&localCountryCarsOnly=true&outputFormat=REACT").json()

# # gets CarGuru make and model id to find price for individual make and model
# def get_cargurus_maker_and_model_ids(all_cars, car_make, car_model):
#     try:
#         all_models = [x for x in all_cars.get('allMakerModels').get('makers') if x.get('name') == car_make][0]
#         return (all_models.get('id'), [x for x in all_models.get('models') if x.get('name') == car_model][0].get('id'))
#     except (IndexError, AttributeError):
#         return (None, None)
    
# # gets the entity id which includes the make, model, and year of car
# def get_entity_id(maker_id, model_id, car_year):
#     try:
#         all_entities = requests.get(f"https://www.cargurus.com/Cars/getSelectedMakerModelCarsAJAX.action?showInactive=false&useInventoryService=false&quotableCarsOnly=false&localCountryCarsOnly=true&outputFormat=REACT&maker={maker_id}").json()
#         model_entity_ids = [car for car in all_entities.get('models') if car.get('id') == model_id][0]
#         return [ids for ids in model_entity_ids.get('cars') if ids.get('year') == car_year][0].get('id')
#     except (IndexError, AttributeError):
#         return None
    
# # gets the estimated listing price of the car based on entity id and the mileage
# def get_price(car_make, car_model, car_year, car_mileage, car_zip_code, all_cars):
#     maker_id, model_id = get_cargurus_maker_and_model_ids(all_cars, car_make, car_model)
#     if not model_id or pd.isna(car_mileage):
#         return None
    
#     entity_id = get_entity_id(maker_id, model_id, car_year)
    
#     if not entity_id:
#         return None 
#     # data needed to request CarGurus report
#     data = {
#         'carDescription.radius': 75,
#         'selectedEntity': entity_id,
#         'carDescription.transmission': "",
#         'carDescription.mileage': car_mileage,
#         'carDescription.postalCode': car_zip_code,
#         'carDescription.engineId': "",
#         'carDescription.vin': "",
#         'carDescription.vinType': "",
#         'forPrivateListing': True,
#         'inventoryListingId' : ""
#     }
    
#     res = requests.post("https://www.cargurus.com/Cars/generateReportJsonAjax.action", data=data)
#     res.raise_for_status()
#     try:
#         return res.json().get("priceDetails").get("privateListingPrice") #private listing price from CarGurus report
#     except AttributeError:
#         raise Exception(res.json())


# for index, row in cars.iterrows():
#     if row["city"] in zip_codes.keys():
#         try:
#             expected_price = get_price(row["make"], row["model"], row["year"], row["mileage"], zip_codes.get(row["city"]), all_cars)
#         except Exception as e:
#             print(e)
#             time.sleep(5)
#             continue
#         # change expected prices that are 0 to none
#         if expected_price == None or expected_price < 1:
#             expected_price = None 
#             print(index)
#         cars.loc[index, "expected_price"] = expected_price
#     else:
#         cars.loc[index, "expected_price"] = None

# # Add zip codes to a column
# for index, row in cars.iterrows():
#     cars.loc[index, "zip_code"] = zip_codes.get(row["city"])
    
# # save cars dataframe to pickle
# cars.to_pickle("./cars.pkl")

# pickle_cars.info() # 36,325 expected prices

### Exploratory Analysis

We will visualize our data in multiple ways to check our data scraping procedures and make sure we did not incorrectly classify our data. The first basic check we will do is scrolling through the data frame for any obvious errors using the display command. We will then use the describe command to look at the descriptive statistics of each column in our dataframe. Next we will visualize the data using a scatterplot matrix in order to check the histograms of each parameter for outliers and general trends. We can also use the scatterplot matrix to explore correlations between different parameters. We will also visualize a heat map of the correlation matrix to determine which parameters are strongly correlated. This information will be used to identify potential strong predictors for the multiple linear regression and determine if any parameters are potential confounders.

### Analysis Methodology

#### Regression
We will use regression to see if we can predict the price of a newly-listed used car. Our dependent variable will be list price and possible independent variables we will analyze include: year,  seller type (dealer, private), mileage, color, title (clean/salvaged), transmission type, cylinders, fuel type, number of doors, exterior/interior condition, listing date, page views per day (when a listing has reached 7 days). We will use the Python package [statsmodels](http://www.statsmodels.org/stable/index.html) to perform all regression analyses. We will do a multiple linear regression first using the parameters that had strong correlations with list price. Based off of this initial model we will adjust our multiple linear regression to only include parameters that have significant p-values for their individual coefficients. We will use a significance level of 𝝰=0.05. Our final model should have a p-value < 0.05 for the F-statistic of the overall model. We are aiming to explain at least 70% of the variance with our model and hope to get an R-squared value of 0.70 or more.

#### Clustering
We plan to cluster what we classify as a “good deal” in its respective geographical location and create clusters showing areas in Utah where cars are generally sold for a good deal. We're going to create a heat map that displays the average percent difference between the cargurus expected price and the list price to find geographical locations of good deals.

### Project Schedule

#### February 24th - 28th
* Check data accessibility (robots.txt and terms of conditions) 
* Basic info due Wed Feb 26th
* Project Proposal due Fri Feb 28th
#### March 2nd - 6th
* Download html files for all recent listings from ksl
* Begin data scraping and create one dataframe with each row as a listing
* Get/give peer feedback March 5th
* Written feedback from staff by March 8th
#### March 9th - 13th (Spring Break)
* Finish data scraping 
* Exploratory analysis
* Describe 

#### March 16th - 20th
* Exploratory analysis
    * Scatter Matrix
        * Interpret histograms - check if there are any outliers that could be an error from scraping
        * Interpret correlations
    * Heatmap of Correlation Matrix
        * Interpret Correlations

#### March 23rd - 27th
* Write up project milestone
* Project milestone due March 29th 
* Acquired, cleaned data, EDA, Sketches of your analysis methods, Submit zip file with Jupyter Notebook, data, other resources.

#### March 30th - April 3rd 
* Get staff feedback
* Begin testbed for good deal predictions based on relation to scraped historical dataset

#### April 6th - April 10th
* Finalize predictive model for new listings
* Script and film project video

#### April 13th - 17th
* Polish up repository in preparation for final submission
* Edit and finalize project video
* Project Due Sunday April 19th
* Project awards April 21st



### Peer Feedback
Our Reviewers: Kyle Cornwall, Shushanna Mkrtychyan

* This is pretty similar to cargurus.com and KBB. How is this different than those existing sites?

* How do you know if a car has been in an accident?

* Look for granularity of NADAguides and devise ways that we can "beat" that model.

* Consider doing feature transformation when doing regression.

* Can you enhance the dataset with some other website?

* What features do other car valuation websites use to generate their price predictions?

* Can you get Carfax info from VIN? (without breaking the bank)

* Three potential classes when predicting a value (good, average, bad)

* Might need to downselect the number of cars we can predict prices for since our dataset size could be limited (i.e. top 20 most frequent cars)

### Video

Add link to final video