# Getting Started with Fortunato Wheels Data

This notebook sets up the basic pieces to get started with the Fortunato Wheels data. It will:

1. Load in the Cargurus data 
2. Connect to the database and get all ads (once you have the MongoDB connection string)
3. Perform basic preprocessing on the data
4. Create a simple plot with Plotly!

In [24]:
import os
import sys

import pandas as pd
import plotly.express as px
import logging

# make sure we can import modules from the src directory
cur_dir = os.getcwd()
SRC_PATH = cur_dir[
    : cur_dir.index("fortunato-wheels-engine") + len("fortunato-wheels-engine")
]
if SRC_PATH not in sys.path:
    sys.path.append(SRC_PATH)

from src.data.car_ads import CarAds
from src.logs import get_logger

# Create a custom logger
logger = get_logger(__name__)

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### What Makes/Models do we have data for?

You can check what models are available for each make of car by name. If you don't specify a make it will return all the makes of cars available.

In [25]:
# create a CarAds object which handles all data operations
ads = CarAds()

# check all the makes available
ads.find_make_model_names()

Makes
"Jeep, Land Rover, Subaru, Mazda, Alfa Romeo, BMW, Hyundai, Chevrolet, Lexus, Cadillac, Chrysler, Dodge, Mercedes-Benz, Nissan, Honda, Kia, Ford, Lincoln, Audi, Jaguar, Volkswagen, RAM, Porsche, Toyota, INFINITI, GMC, Acura, Maserati, FIAT, Volvo, Mitsubishi, Buick, Mercury, Scion, Saab, MINI, Ferrari, Genesis, Saturn, Bentley, Suzuki, Tesla, Fisker, Pontiac, Lamborghini, smart, Hummer, Rolls-Royce, Lotus, Spyker, McLaren, Aston Martin, Kaiser, Oldsmobile, Maybach, Freightliner, Karma, Isuzu, Plymouth, Shelby, Triumph, MG, Pagani, Datsun, Studebaker, AM General, Austin-Healey, AMC, Hudson, Willys, Pininfarina, Sunbeam, Geo, Opel, SRT, Edsel, VPG, Eagle, Bugatti, Daewoo, Hillman, Austin, Morris, Packard, Humber, DeTomaso, International Harvester, Ariel, DeSoto, Allard, Bricklin, DeLorean, Nash, Clenet, Mobility Ventures, Franklin, Jensen, Saleen, Koenigsegg, Rover, Infiniti, Other, Fiat, Smart, Mercedes-AMG, Polestar, Daihatsu, Austin Healey, Peugeot, Renault"


In [26]:
# to figure out what models there are, use the find_make_model_names method
ads.find_make_model_names(make = "Subaru")

Unnamed: 0,Subaru
cargurus,"['WRX STI', 'Impreza', 'Outback', 'WRX', 'XV Crosstrek', 'Forester', 'Crosstrek', 'Legacy', 'BRZ', 'Impreza WRX', 'Impreza WRX STI', 'Ascent', 'Crosstrek Hybrid', 'B9 Tribeca', 'Tribeca', 'Baja', 'XV Crosstrek Hybrid', 'SVX']"
kijiji,"['Impreza', 'Outback', 'Forester', 'Legacy', 'WRX', 'XV Crosstrek', 'BRZ', 'Impreza WRX STi', 'Ascent', 'B9 Tribeca', 'Tribeca', 'Other', 'Baja', 'Solterra', 'SVX']"


### I want to get all ads for a specific make and model

You can get all ads for a specific make and model from the multiple datasets each argument is optional, if you leave it out it will return all ads that match the criteria you did specify.

In [4]:
make = "Subaru"
model = "Outback"
year_range = (2008, 2012)

ads.get_car_ads(year_range = year_range, make = make, model = model)

2023-04-06 00:36:25,843 - src.data.car_ads - INFO - Getting all car ads from all sources...
2023-04-06 00:36:25,846 - src.data.car_ads - INFO - Getting all cargurus car ads...
2023-04-06 00:36:27,604 - src.data.car_ads - INFO - Found 1318 cargurus car ads.
2023-04-06 00:36:27,800 - src.data.car_ads - INFO - Getting all kijiji car ads...
2023-04-06 00:36:29,346 - src.data.car_ads - INFO - Found 79 kijiji car ads.
2023-04-06 00:36:29,357 - src.data.car_ads - INFO - Found 1397 car ads.


In [5]:
ads.df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1397 entries, 0 to 1396
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   make                      1397 non-null   object        
 1   model                     1397 non-null   object        
 2   year                      1397 non-null   int64         
 3   listed_date               1318 non-null   datetime64[ns]
 4   price                     1394 non-null   float64       
 5   mileage                   1391 non-null   float64       
 6   major_options             1304 non-null   object        
 7   seller_rating             1295 non-null   float64       
 8   horsepower                1318 non-null   float64       
 9   fuel_type                 1318 non-null   category      
 10  wheel_system              1313 non-null   category      
 11  currency                  1394 non-null   object        
 12  exchange_rate_usd_to

In [6]:
# plot the distribution of prices
fig = px.histogram(
    ads.df.query("price < 200_000"),
    x="price",
    color="source",
    title=f"Distribution of Prices for {make} {model} {year_range[0]}-{year_range[1]}",
).update_layout(
    xaxis_title="Price (CAD)",
    yaxis_title="No. of Ads",
)
fig.show()