# Milestone

### Basic Information

Title: KSL AutomoDeals
<br>
Names: Chantel Charlebois, Taylor Hansen, Michael Paskett
<br>
E-mails: chantel.charlebois@utah.edu, taylor.c.hansen@utah.edu, michael.paskett@utah.edu
<br>
UIDS: u1043299, u0642850, u1144000

### Background and Motivation

Almost every person in Utah (and some neighboring states) buying a used car will visit KSL Cars classifieds to look for their new wheels. There are few resources for understanding the rough value of a used car, such as Kelly Blue Book (kbb.com), but such services cannot fully integrate the complex auto market of a local area. By storing and analyzing the prices, details, and options for a certain model or class of vehicle, a prospective buyer can evaluate how good the listed price for a vehicle actually is. With such a model, the user can estimate how much a specific car is really worth, and determine if the vehicle is worthy of a test drive.

### Project Objectives

Questions:
* How well can we predict the price of a newly-listed car based on the attributes available in an advertisement?
* Which attributes are most influential in determining the vehicle price?
* What areas have the best price of cars?
Aims:
* Create a regression model that will predict the expected price of a car based on several attributes, such as:
    * year,  seller type (dealer, private), mileage, color, title (clean/salvaged), transmission type, cylinders, fuel type, number of doors, exterior/interior condition, listing date, page views per day (when a listing has reached 7 days)
* Create a clustered map of “good deals” in different regions

Benefits:
* This project could benefit anyone in the market for a used car, helping them to be informed of the potential value of a car they are interested in.

### Data

We scraped our data from queries of used cars listed on [cars.ksl.com](https://cars.ksl.com/). We have read the robots.txt files for both ksl.com and cars.ksl.com to confirm that there are no restrictions for crawling their website. We have also reviewed the terms and conditions and have similarly found no indication for rules on crawling. There used to be an undocumented API for interfacing with KSL (as of four years ago), but it is no longer publicly accessible, so we will be manually scraping with BeautifulSoup.

To avoid consistently using too much bandwidth on their website, we started with “historical” data collection by saving .html pages over the course of the project so that we can parse them offline. However, we now dynamically scrape data and the information we scrape is dynamically loaded by JavaScript, the html files were bloated with other associated files which would be too unwieldy to download on a daily basis (on the order of 5,000 cars/day). Instead, we opted for a live crawler approach to get our data.

We used CarGurus website https://www.cargurus.com/Cars/sell-car/?pid=SellMyCarDesktopHeader and their API to request an expected car price based on each cars Make, Model, Mileage, Year, and Zip Code. 

### Ethical Considerations

Stakeholders:
* The creators (us)
* The seller
* The prospective buyer
* KSL

Our incentive as creators and prospective buyers is to find good deals without having to manually spend hours searching through KSL for a good deal. For other prospective buyers, the same applies. The sellers have competing interests, as they would like to sell their car for as much as possible. KSL also has a stake in this project, as it makes revenue from ads and from sellers paying for better listings in order to make their vehicle more prominent.
We anticipate that other ethical considerations may arise as the project progresses and details are worked out.

### Data Processing

Each listing page has a fairly consistent format making scraping feasible for the large number of pages we will be analyzing. The quantities we plan to derive from our data have been listed above in the Project Objectives section. When creating a listing, the user is required to list the year, VIN, make, model, body style, mileage, title type, asking price, and ZIP code. A timestamp is also associated with each listing. Together, these are the only features we can guarantee to extract from each page. Of course, many listings have many more details listed than these which we can and plan to use.

As mentioned above, data will be scraped  with BeautifulSoup and will be structured into a pandas dataframe. Dummy variables for categorical variables may be generated and concatenated to this dataframe to facilitate use of these variables. Subsequent processing will be done using built-in pandas masking to get relevant rows from the dataframe for new queries when searching recently listed used cars.

### Exploratory Analysis

We will visualize our data in multiple ways to check our data scraping procedures and make sure we did not incorrectly classify our data. The first basic check we will do is scrolling through the data frame for any obvious errors using the display command. We will then use the describe command to look at the descriptive statistics of each column in our dataframe. Next we will visualize the data using a scatterplot matrix in order to check the histograms of each parameter for outliers and general trends. We can also use the scatterplot matrix to explore correlations between different parameters. We will also visualize a heat map of the correlation matrix to determine which parameters are strongly correlated. This information will be used to identify potential strong predictors for the multiple linear regression and determine if any parameters are potential confounders.

### Analysis Methodology

#### Regression
We will use regression to see if we can predict the price of a newly-listed used car. Our dependent variable will be list price and possible independent variables we will analyze include: year,  seller type (dealer, private), mileage, color, title (clean/salvaged), transmission type, cylinders, fuel type, number of doors, exterior/interior condition, listing date, page views per day (when a listing has reached 7 days). We will use the Python package [statsmodels](http://www.statsmodels.org/stable/index.html) to perform all regression analyses. We will do a multiple linear regression first using the parameters that had strong correlations with list price. Based off of this initial model we will adjust our multiple linear regression to only include parameters that have significant p-values for their individual coefficients. We will use a significance level of 𝝰=0.05. Our final model should have a p-value < 0.05 for the F-statistic of the overall model. We are aiming to explain at least 70% of the variance with our model and hope to get an R-squared value of 0.70 or more.

#### Clustering
We plan to cluster what we classify as a “good deal” in its respective geographical location and create clusters showing areas in Utah where cars are generally sold for a good deal. We're going to create a heat map that displays the average percent difference between the cargurus expected price and the list price to find geographical locations of good deals.

### Project Schedule

#### February 24th - 28th
* Check data accessibility (robots.txt and terms of conditions) 
* Basic info due Wed Feb 26th
* Project Proposal due Fri Feb 28th
#### March 2nd - 6th
* Download html files for all recent listings from ksl
* Begin data scraping and create one dataframe with each row as a listing
* Get/give peer feedback March 5th
* Written feedback from staff by March 8th
#### March 9th - 13th (Spring Break)
* Finish data scraping 
* Exploratory analysis
* Describe 

#### March 16th - 20th
* Exploratory analysis
    * Scatter Matrix
        * Interpret histograms - check if there are any outliers that could be an error from scraping
        * Interpret correlations
    * Heatmap of Correlation Matrix
        * Interpret Correlations

#### March 23rd - 27th
* Write up project milestone
* Project milestone due March 29th 
* Acquired, cleaned data, EDA, Sketches of your analysis methods, Submit zip file with Jupyter Notebook, data, other resources.

#### March 30th - April 3rd 
* Get staff feedback
* Begin testbed for good deal predictions based on relation to scraped historical dataset

#### April 6th - April 10th
* Finalize predictive model for new listings
* Script and film project video

#### April 13th - 17th
* Polish up repository in preparation for final submission
* Edit and finalize project video
* Project Due Sunday April 19th
* Project awards April 21st

### Peer Feedback
Our Reviewers: Kyle Cornwall, Shushanna Mkrtychyan

* This is pretty similar to cargurus.com and KBB. How is this different than those existing sites?

* How do you know if a car has been in an accident?

* Look for granularity of NADAguides and devise ways that we can "beat" that model.

* Consider doing feature transformation when doing regression.

* Can you enhance the dataset with some other website?

* What features do other car valuation websites use to generate their price predictions?

* Can you get Carfax info from VIN? (without breaking the bank)

* Three potential classes when predicting a value (good, average, bad)

* Might need to downselect the number of cars we can predict prices for since our dataset size could be limited (i.e. top 20 most frequent cars)

# Description of Folder Contents

### Main Directory (automodeals)
`AddNewerCarsToRepository.ipynb`:
* This is a (now deprecated) notebook which adds newly listed used cars to our `all_cars.csv` data repository. This functionality is now handled by `updateWrapper.py`

`Data_Exploration_Analysis.ipynb`:
* This notebook explores the data, cleans it up, and does our preliminary analysis.
Data Exploration
1. Scroll through the dataframe for any obvious errors
2. Describe command to verify the descriptive statistics
3. Visualize using the scatterplot matrix to explore histograms of each parameter and look for outliers/trends
4. Heat map of the correlation matrix to determine which parameters are strongly correlated to identify potential strong predictors for the multiple linear regression or potential confounders

Data Analysis
Regression
* Aim: predict the price of a newly-listed used car 
* Dependent variable:list price 
* Possible independent variables:year, seller type (dealer, private),mileage, color, title (clean/salvaged), transmission type, cylinders, fuel type, number of doors, exterior/interior condition, listing date, page views per day (when a listing has reached 7 days). 
* We will use the Python package statsmodels to perform all regression analyses.

Methods
* Multiple linear regression first using the parameters that had strong correlations with list price.
* Based off of this initial model we will adjust our multiple linear regression to only include parameters that have significant p-values for their individual coefficients. 
* Multiple linear regression for most popular makes/models of cars so that we are not building a regression model across all cars but a particular make/model
Significance level: 𝝰=0.05

Expected Outcomes:
* Our final model should have a p-value < 0.05 for the F-statistic of the overall model.
* We are aiming to explain at least 70% of the variance with our model and hope to get anR-squared value of 0.70 or more.
Clustering
* Aim: Cluster what we classify as a “good deal” in its respective geographical location and create clusters showing areas in Utah where cars are generally sold for a good deal. We will use the CarGurus expected price as the measure of a fair price. 


`Favorites_Views_Updater.ipynb`:
* This notebook searches through cars that have been found previously and updates them with the number of page views and favorites.  

`GetExpectedPrice.ipynb`:
* This notebook uses the CarGurus site to get an expected price for each car based on the Make, Model, Year, Mileage, and Zip Code. We currently were able to get expected prices for ~25,000 cars.

`Main.ipynb`:
* This file will be used to display clean code for the final submission, but it is empty for now.

`Milestone.ipynb`:
* This current file.

`Peer_Review_Notes.ipynb`:
* Notes from the peer review (already integrated in the section above).

`ProtoCrawlerScraper.ipynb`:
* This is the initial testbed used to try out code and techniques for the data scraping and updating portions of our project. It has since been split into several other files including:
    * `AddNewerCarsToRepository.ipynb`
    * `RestarRepository.ipynb`
    * Most of the functions in the automodeals/py directory

`RestartRepository.ipynb`:
* This is a (now deprecated) notebook which finds all used cars on KSL and adds them to a data repository called `all_cars.csv`. This functionality is now handled by `RestartRepository.py`

### automodeals/data
##### /archive
* As the name implies, the archive directory holds old data that is no longer part of the main analysis but was useful for early prototyping.

##### /daily
* The automodeals/data/daily directory is a backup directory for the daily data scraped from KSL. It also serves as a good way to gauge how much the repository is growing daily. Depending on how fast the repository grows, this functionality may be removed in the future.

`all_cars.csv`:
* This is the main data file and is updated daily by `updateWrapper.py` as an automated scheduled task. As of writing, we've amassed over 35,000 cars' worth of data for our repository with ~5,000 being added each day.

`all_cars_view_fav.csv`:
* Created by the `Favorites_Views_Updater`. Contains listing views and favorite numbers.  

### automodeals/errors
`git_error_log.csv`:
* This is an error log that is updated when there are errors running `uploadRepository.py` when called from `updateWrapper.py`

`liters_error_log.csv`:
* This is an error log populated with strings from when the `liters` column of `all_cars.csv` was not parsed with the existing regular expression. The idea is that this info can be used to inform a more robust regular expression for future scraping.

### automodeals/html
This is a directory in which we originally planned to store all of the html files we downloaded from KSL. However, because much of the information we scrape is dynamically loaded by JavaScript, the html files were bloated with other associated files which would be too unwieldy to download on a daily basis (on the order of 5,000 cars/day). Instead, we opted for a live crawler approach to get our data.

### automodeals/py
`AddNewerCarsToRepository.py`:
* As the name implies, this function crawls KSL and finds any newly listed cars since the last time it was run. It is called by `updateWrapper.py`.

`carscraper.py`:
* Called by `AddNewerCarsToRepository.py`. This function does the heavy lifting of crawling KSL search pages and subsequent listings and returning a pandas dataframe to update `all_cars.csv`.

`carscraper_verbose.py`:
* This is a verbose version of the above function with many more print and try-except statements for debugging purposes.

`determinemaxpg.py`:
* This function is called by `RestartRepository.py` to determine how many pages of used car search results there are to crawl through.

`Favorites_views_updater.py`:
* This function pulls likes and favorites, similar to Favorites_Views_Updater.ipynb in the main directory. This will soon be run daily.

`generateProxies.py`:
* As the name implies, this function uses selenium to generate a proxy list with which to continue to visit KSL (in the case of an IP block). It is only called by `carscraper.py` if an IP block is detected (via a 403 error).

`merge_all_with_fav.py`:
* This function merges the original scraped listing data with the data including likes and favorites.  

`removeduplicates.py`:
* This function navigates through the `all_cars` dataframe and removes duplicate rows (based on the `link` column) prior to saving to `all_cars.csv`. Duplicates can sometimes appear because as the crawler moves from one search page to the next, a few cars from the previous page may have trickled onto the new page as newer cars are listed.

`RestartRepository.py`:
* This function can be called to regenerate the `all_cars.csv` data repository from scratch.

`updateWrapper.py`:
* This is the main data update script to add newly listed cars to the all_cars.csv data repository. It calls `AddNewerCarsToRepository.py` to scrape KSL for newly listed cars and then pushes the updated repository to GitHub via `uploadRepository.py`.
* This script is handled fully automatically now via a task scheduler on a remote machine. Approximately 5,000 cars are added to our repository on a daily basis, thanks to this script.

`uploadRepository.py`:
* Called by `updateWrapper.py`. This function automatically add, commits, and pushes the updated `all_cars.csv` to our GitHub repository.

### automodeals/screenshots
This directory houses any screenshots of important progress points that don't otherwise exist in our codebase (e.g. setting up `updateWrapper.py` as a scheduled task) that we may wish to include in our final submission.


### automodeals/selenium
`proxyIPselenium.ipynb`:
* This file lists important instructions for getting selenium setup on a user's machine so that a set of proxy IP addresses can be generated for scraping, if needed.

`chromedriver.exe`:
* This is the executable necessary to run an automated version of Chrome via selenium.