## Data Loading
Let's load the data and take a look into what is in there and how we can use it

In [4]:
#Install packages needed
!pip install -r requirements.txt



In [5]:
## Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

lets load the data from the csv file generated by the scraper

In [6]:
df = pd.read_csv('./get_data/crawl_data.csv')
brands = pd.DataFrame(df.brand.unique()).sort_values(0).reset_index(drop=True)
display(brands)
print('Nº of different engines %s' % len(df.engine.unique()))
df.info()


Unnamed: 0,0
0,Audi
1,BMW
2,BYD
3,Brilliance
4,Chana
5,Changan
6,Changhe
7,Chery
8,Chevrolet
9,Chrysler


Nº of different engines 296
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2016 entries, 0 to 2015
Data columns (total 16 columns):
brand           2016 non-null object
color           2016 non-null object
engine          1757 non-null object
fuel            1781 non-null object
id              2016 non-null object
image_urls      2016 non-null object
images          2016 non-null object
location        2016 non-null object
mileage         2016 non-null int64
model           2016 non-null object
onlyOwner       1309 non-null object
price           2016 non-null object
steering        2016 non-null object
traction        1365 non-null object
transmission    2016 non-null object
year            2016 non-null int64
dtypes: int64(2), object(14)
memory usage: 252.1+ KB


There is a lot of different type of engines, they are all in different formats (some ar in liters others in C.C),  and some don't even have that, they just put the name.

Because of this there is not an accesible way to make this information reliable so I choose to delete it.

In [7]:
df = df.drop(columns=['engine', 'image_urls', 'images'])
display(df.head())

Unnamed: 0,brand,color,fuel,id,location,mileage,model,onlyOwner,price,steering,traction,transmission,year
0,Mercedes Benz,Azul,,MCO472077653,El vehículo está en Belmira - Usaquén - Bogotá...,35000,Clase C,,46.950.000,Hidráulica,,Automática,2011
1,Volkswagen,Plateado,Gasolina,MCO474298545,El vehículo está en Bogota - Bogotá D.c.,67700,Jetta,No,31.900.000,Hidráulica,4x2,Mecánica,2014
2,Toyota,Beige,Gasolina,MCO471864641,El vehículo está en Castropol - Medellín - Ant...,97000,Fortuner,,66.900.000,Hidráulica,4x2,Automática,2013
3,Nissan,Blanco,Gasolina,MCO472924395,El vehículo está en Bogota - Bogotá D.c.,135484,Murano,No,35.000.000,Hidráulica,4x4,Automática,2004
4,Renault,Blanco,,MCO471930027,El vehículo está en Los Alpes - Pereira - Risa...,4200,Twizi,Sí,29.990.000,Mecánica,,Automática,2016


next variable to be deal with is the location, this starts with *El vehículo está en* wich needs to be removed, and its also separated by hyphens that also need to be removed

In [8]:
def clean_location(locations, starts_with='El vehículo está en ',):
    location_only_places = locations.str.replace(starts_with, '')
    locations_split = location_only_places.str.split('-')
    location_reverse = locations_split.apply(reversed).apply(list)
    return location_reverse

def get_state_city(location):
    return pd.DataFrame(location.values.tolist())[[0,1]]
    
loc_temp = clean_location(df.location)
df_locations = get_state_city(loc_temp)
df_locations.columns = ['State', 'City']
df = pd.concat([df,df_locations], axis=1)
df = df.drop(columns=['location'])
df = df.set_index('id')

Another element that we must be aware of, is that car manufacturers create diferent cars for diferent *sectors*, back in the day (the beaggining of car manufacturing), a manufacuter had one car and that was it, but as the market grow and the necesities of the people started changing, so did the cars, nowdays a manufacturer tries produces at least on car per **segment**, the reason to consider this is because it does not make any sence to scense to compare a truck with an automovile,that its way we are going to create a new variable from the data we have so far and pulling what we are missing from the internet.

In [49]:
# Import exta libraries
import requests
from bs4 import BeautifulSoup
import urllib
from fuzzywuzzy import fuzz
import unidecode



# Get the brand and model combination so we can access the car database
brand_model = df[['brand', 'model', 'year']]

car_info = pd.DataFrame(columns=['brand', 'model', 'year', 'doors', 'type'])
car_info.set_index(['brand', 'model', 'year'])

def select_model(option_models):
    print('-1 - None of the above')
    for index, option in enumerate(option_models):
        print('{} - {}'.format(index, option['name']))
    selection = int(input('Select a number'))
    if selection == -1:
        return None
    return option_models[selection]

def data_exist(brand, model, year):
    try:
        row = car_info.loc[(brand, model, year), :]
        print('DATA EXIST : {}'.format(row))
        return True
    except KeyError:
        print('DATA EXIST: {}'.format('NO KEY'))
        return False
        
    return False

def fetch_data(brand, model, year):
    url_database = 'https://www.cars-data.com/ajax_files/'
    brand_simplified = unidecode.unidecode(brand.lower().replace(' ', '-'))
    url_brand = 'get_groups.php?url={brand}'.format(brand=brand_simplified)
    brand_req = requests.get(url_database + url_brand)
    soup = BeautifulSoup(brand_req.content, "lxml")
    options_models = list(map(
        lambda x: {'value':x['value'],'name': x.text},
        soup.findAll('option')[1:] # The first element is the text "Select model (e.g. Focus)"
    ))
    similarities_models = list(map(
        lambda x: fuzz.partial_ratio(model, x['name']),
        options_models
    ))
    
    if len(options_models) == 0:
        print('No information about the brand {}'.format(brand_simplified))
        return None
    
    most_likely = np.argsort(similarities_models)[-1]
    model_chosen = options_models[most_likely]['value']
    if similarities_models[most_likely] < 90: # If there is not a specific match it may be because the manufacturer its not right
        model_chosen = select_model(options_models)
        print(model_chosen)
        model_chosen = model_chosen['value']
        if model_chosen is None:
            print('Nothing got selected')
    url_years = 'get_years.php?url={}'.format(model_chosen)
    year_req = requests.get(url_database + url_years)
    year_soup = BeautifulSoup(year_req.content, 'lxml')
    options_year = list(map(
        lambda x: {'value': x['value'], 'name': x.text},
        year_soup.findAll('option')
    ))
    year_index = 0
    while(int(year) > int(options_year[year_index]['name'])):
        year_index += 1
    
    year_chosen = options_year[year_index]['value']
    
    car_uri = year_chosen.split('|')[1]
    car_url = 'https://www.cars-data.com/en/{}'.format(car_uri)
    
    car_req = requests.get(car_url)
    car_soup = BeautifulSoup(car_req.content, 'lxml')
    section_p = car_soup.find('section', {'class': 'title'}).p.text
    car_description = section_p.split(',')[1]
    car_spec = car_description.split()[:2]
    df_new_data = pd.DataFrame({'brand': [brand],
                                'year': [year],
                                'model': [model],
                                'doors':  [car_spec[0]],
                                'type': [car_spec[1]]})
    return df_new_data
    
def get_data(brand, model, year):
    if data_exist(brand, model, year):
        return
    new_data = fetch_data(brand, model, year)
    df2 = car_info.append(new_data)
    return df2

for index, row in df.iterrows():
    brand, model, year = row[['brand', 'model', 'year']]
    print('Calling get data with {}, {}, {}'.format(brand, model, year))
    car_info = get_data(brand, model, year)
    print(car_info)
    
    

Calling get data with Mercedes Benz, Clase C, 2011
DATA EXIST: NO KEY
-1 - None of the above
0 - 190-serie Sedan
1 - 200-serie Sedan
2 - 200-serie Cabrio Convertible
3 - 200-serie Combi Wagon
4 - 200-serie Coupe Coupe
5 - A-class Hatchback
6 - A-class Mpv
7 - A-class Coupe Mpv
8 - A-class Lang Mpv
9 - AMG GT Coupe
10 - AMG GT Roadster Convertible
11 - B-class Mpv
12 - B-class Electric Drive Mpv
13 - C-class Sedan
14 - C-class Cabriolet Convertible
15 - C-class Combi Wagon
16 - C-class Coupe Coupe
17 - C-class Estate Wagon
18 - C-class Sportcoupe Coupe
19 - C-class Sports Coupe Coupe
20 - Citan Tourer Mpv
21 - CL-class Coupe
22 - CLA-class Coupe Sedan
23 - CLA-class Shooting Brake Wagon
24 - CLC-class Coupe
25 - CLK Cabriolet Convertible
26 - CLK-class Convertible
27 - CLK-class Coupe
28 - CLK-class Cabriolet Convertible
29 - CLS-class Sedan
30 - CLS-class Shooting Brake Wagon
31 - E-class Sedan
32 - E-class All-Terrain Wagon
33 - E-class Cabriolet Convertible
34 - E-class Combi Wagon
3

IndexError: list index out of range