# CarGurus.com Scrape
- Notes: scraped September 2020. Good data. Description, major options, etc
- 3m obs, 66 vars
- https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset
- original filename used_cars_data.csv

### Other dataset-specific notebooks to review
- https://www.kaggle.com/code/alvarolozanoalonso/mileage-tfm-15-09-2021
- https://www.kaggle.com/code/fr3shk/vehicle-price-prediction-part-1
- https://www.kaggle.com/code/fr3shk/vehicle-price-prediction-part-2

## Environment

In [None]:
# environment details

from platform import python_version
print('python',python_version())

from importlib.metadata import version
print('numpy',version('numpy'))
print('pandas',version('pandas'))
print('pandas_profiling',version('pandas_profiling'))

In [None]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from uszipcode import SearchEngine # https://pypi.org/project/uszipcode/ / https://www.pythonpool.com/uszipcode-python/
import missingno as msno # https://github.com/ResidentMario/missingno

## Data

### Load and Inspect

In [None]:
# load data
filename = 'data/cargurus.csv'

# total number of records in csv
# 3,000,598
num_records = sum(1 for line in open('data/cargurus.csv')) - 1

# number of records to load
sample_size = 100000

# randomization
skip = sorted(random.sample(range(1,num_records+1),num_records-sample_size))

# data
df = pd.read_csv(filename
                    ,skiprows = skip
                    ,low_memory = False
                    ,nrows = num_records
                    ,dtype={'dealer_zip': str
                            })
print(df.shape)
df.info(verbose=False)

#### Variable Descriptions
**0 vin:** Vehicle Identification Number is a unique encoded string for every vehicle. A vehicle identification number (VIN) (also called a chassis number or frame number) is a unique code, including a serial number, used by the automotive industry to identify individual motor vehicles, towed vehicles, motorcycles, scooters and mopeds, as defined in ISO 3779 (content and structure) and ISO 4030 (location and attachment).  
**1. back_legroom:** Legroom in the rear seat measured in inches.  
**2. bed:** Category of bed size (open cargo area) in pickup truck. Null usually means the vehicle isn't a pickup truck.  
**3. bed_height:**  Height of bed in inches.  
**4. bed_length:** Length of bed in inches.  
**5. body_type:** Body Type of the vehicle. Like Convertible, Hatchback, Sedan, etc.  
**6. cabin:** Category of cabin size (open cargo area) in pickup truck. Eg: Crew Cab, Extended Cab, etc.  
**7. city:** City where the car is listed. Eg: Houston, San Antonio, etc.   
**8. city_fuel_economy:** Fuel economy in city traffic in km per litre.  
**9. combine_fuel_economy:** Combined fuel economy is a weighted average of City and Highway fuel economy in km per litre.  
**10. daysonmarket:** Days since the vehicle was first listed on the website.  
**11. dealer_zip:** Zipcode of the dealer.  
**12. description:** Vehicle description on the vehicle's listing page.  
**13. engine_cylinders:** The engine configuration. Eg: I4, V6, etc.   
**14. engine_displacement:** Engine displacement is the measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers.  
**15. engine_type:** The engine configuration. Eg: I4, V6, etc.  
**16. exterior_color:** Exterior color of the vehicle, usually a fancy one same as the brochure.  
**17. fleet:** Whether the vehicle was previously part of a fleet.  
**18. frame_damaged:** Whether the vehicle has a damaged frame.  
**19. franchise_dealer:** Whether the dealer is a franchise dealer.  
**20. franchise_make:** The company that owns the franchise.  
**21. front_legroom:** The legroom in inches for the passenger seat.  
**22. fuel_tank_volume:** Fuel tank's filling capacity in gallons.  
**23. fuel_type:** Dominant type of fuel ingested by the vehicle.  
**24. has_accidents:** Whether the vin has any accidents registered.  
**25. height:** Height of the vehicle in inches.  
**26. highway_fuel_economy:** Fuel economy in highway traffic in km per litre.  
**27. horsepower:** Horsepower is the power produced by an engine.  
**28. interior_color:** Interior color of the vehicle, usually a fancy one same as the brochure.  
**29. isCab:** Whether the vehicle was previously taxi/cab.  
**30. is_certified:** Whether the vehicle is certified. Certified cars are covered through warranty period.  
**31. is_cpo:** Pre-owned cars certified by the dealer. Certified vehicles come with a manufacturer warranty for free repairs for a certain time period.  
**32. is_new:** If True means the vehicle was launched less than 2 years ago.  
**33. is_oemcpo:** Pre-owned cars certified by the manufacturer.  
**34. latitude:** Latitude from the geolocation of the dealership.  
**35. length:** Length of the vehicle in inches.  
**36. listed_date:** The date the vehicle was listed on the website. Does not make daysonmarket obsolete. The prices is dayson_market days after the listed date.  
**37. listing_color:** Dominant color group from the exterior color.  
**38. listing_id:** Listing id from the website.  
**39. longitude:** Longitude from the geolocation of the dealership.  
**40. main_picture_url:** URL of the vehicle's picture.  
**41. major_options:** Optional packages of the vehicle.  
**42. make_name:** Vehicle's brand.  
**43. maximum_seating:** Total number of seats.  
**44. mileage:** Refers to the distance that the vehicle has travelled, measured in miles.  
**45. model_name:** Model name of the vehicle.  
**46. owner_count:** Number of owners the vehicle has had along its life.  
**47. power:** Maximum power of the vehicle and the rpm to develop the power.  
**48. price:** Sale price of the vehicle on the website.  
**49. salvage:** In North America, a salvage title is a form of vehicle title branding, which notes that the vehicle has been damaged and/or deemed a total loss by an insurance company that paid a claim on it.  
**50. savings_amount:** Non defined variable.  
**51. seller_rating:** The Seller Rating is created by data received from buyers in an effort to measure the quality of the experience you provide your customers.  
**52. sp_id:** Dealer id.  
**53. sp_name:** Dealer name.  
**54. theft_title:** Vehicle that was stolen and later recovered.  
**55. torque:** Torque indicates the force to which the drive shaft is subjected. Also the revolutions needed to reach the maximum torque.  
**56. transmission:** Type of transmission, such as Automatic, Manual, etc.  
**57. transmission_display:** Number of gears and type of transmission.   
**58. trimId:** Number of a particular version of a model with a particular set of configuration.  
**59. trim_name:** Name of a particular version of a model with a particular set of configuration.  
**60. vehicle_damage_category:** Category of a vehicle's damage, such as Category A meaning a ‘Scrap’ car.  
**61. wheel_system:** Traction system of a vehicle, such as AWD or FWD.  
**62. wheel_system_display:** Traction system of a vehicle, such as All Wheel Drive or Front Wheel Drive.  
**63. wheelbase:** The distance between the front and rear axles of a vehicle.  
**64. width:** The distance between both sides of a vehicle.  
**65. year:** The year the car was built.  

In [None]:
# drop columns unimportant to model
df.drop(columns=['listing_id'
                ,'main_picture_url'
                ,'savings_amount' # involves cargurus price projection
                ,'sp_id' # dealer id
                ,'sp_name' # dealer name
                ] 
    ,inplace=True)

In [None]:
# drop duplicates by vin
print(df.shape)
df.drop_duplicates(subset = 'vin',inplace=True)
print(df.shape)

In [None]:
# drop empty columns
df.dropna(axis = 'columns', how = 'all', inplace = True)
print(df.shape)

In [None]:
print(df.columns)

In [None]:
# set VIN as index
df.set_index("vin", inplace=True)

In [None]:
# drop unneeded obs
print(df.shape)
df = df[df['is_new'] == False]
print(df.shape)

In [None]:
# rename fields
df.rename(columns={'city_fuel_economy':'mpg_city'
                    ,'highway_fuel_economy':'mpg_hwy'
                    ,'exterior_color':'color_exterior'
                    ,'interior_color':'color_interior'
                    ,'year':'model_year'
                    }, inplace=True)

In [None]:
# numeric
df.describe().round(decimals = 1).T

In [None]:
df['back_legroom'].str.split?

In [None]:
# parse mixed fields
df['back_legroom'].str.split()

In [None]:
df['back_legroom'].str.split

In [None]:
# non-numeric
df.describe(include = ['object','category']).T

In [None]:
# five example obs
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 500)

df.sample(3).T

### Missing

In [None]:
# missing values
nulls = pd.DataFrame(data = {'count':df.isna().sum()})
nulls['pct'] = nulls['count'] / df.shape[0]
nulls.sort_values(by = 'count', ascending = False)

In [None]:
# drop columns with > 80% missing values
print(df.shape)
df.dropna(axis = 'columns', thresh = df.shape[0] * 0.8, inplace = True)
print(df.shape)

In [None]:
msno.heatmap(df)

In [None]:
# missingness correlations
temp = df.iloc[:, [i for i, n in enumerate(np.var(df.isnull(), axis='rows')) if n > 0]]
corr_mat = temp.isnull().corr()

corr_mat2 = corr_mat.unstack().reset_index()
corr_mat2.columns = ['var1','var2','corr']
corr_mat2['var_min'] = corr_mat2[['var1','var2']].min(axis=1)
corr_mat2['var_max'] = corr_mat2[['var1','var2']].max(axis=1)
corr_mat2.drop(columns=['var1','var2'], inplace=True)
corr_mat2.drop_duplicates(inplace=True)
corr_mat2 = corr_mat2[corr_mat2['var_min'] != corr_mat2['var_max']]
corr_mat2 = corr_mat2[corr_mat2['corr'] > 0.1]
corr_mat2.sort_values(by='corr', ascending=False)

#### Mileage
Mileage is an essential feature expected to have a large impact on the model. Better to drop than impute.

In [None]:
print(df.shape)
print(df['mileage'].isna().sum())
df = df[df['mileage'].notna()]
print(df.shape)

Mileage is mostly missing on new vehicles

### Transformations

In [None]:
# update dtypes

### States

In [None]:
%%time

# add state
df['state'] = np.nan

for i in df.index:
    search = SearchEngine()
    zipcode = search.by_zipcode(df.loc[i,'dealer_zip'])
    try:
        df.loc[i,'state'] = zipcode.state
    except:
        pass
    else:
        df.loc[i,'state'] = zipcode.state

print('state not found:',df['state'].isna().sum())

In [None]:
print((df['state'].value_counts()).round(decimals = 2).head(10))

In [None]:
print((df['state'].value_counts())[['TX','OH','VA']].round(decimals = 2))
print((df['state'].value_counts()/df.shape[0])[['TX','OH','VA']].round(decimals = 2))
print((df['state'].value_counts()/df.shape[0])[['TX','OH','VA']].sum().round(decimals = 2))
print(f'{0.18*3000000:,}')