# 2. odovzdanie
**termin: 11.4.2020**
- strucny opis prac inych autorov = related work 2 az 3 zdroje (1-2 vedecke)
    - clanky pracujuce s rovnakym datasetom, podobnou ulohou (nasa domena / podobna)
    - nevedecky zdroj - nieco z kagglu, blog ...nieco co viem odcitovat
    - zhrnutie: co dosiahli, ake data, ake algoritmy... mozme sa porovnat?
- predspracovanie
- DM metody
    - ake metody pouzijeme - regresia: linearna, NN, DST ...
- prvotne experimenty
    - prototyp na zahodenie
- vyhodnocovanie
    - praca v dalsej faze
    
### predpracovanie a vyber atributov (realizovanie scenara z analyzy)
- zjednotit formaty dat, oprava chybnych hodnot, ouliers (chyba/real), chybajuce hodnoty (priemerom, medianom, knn)
- transformacia dat (kategoricke na numericke)
- feature engineering
    - z existujucich crt odvodit nove
    - kombinacia existujucich
    - domenove znalosti
- feature selection
    - vyber podmnoziny atributov / crt
    - niektore robia same: lin. regresia, DST, RDF
    - filter (vyber pred trenovanim), wrapper (vyber naviazany na trenovany model), embed
    
- rozdelenie dat - train, valid, test split
- metriky vyhodnocovania (regresia - RMSE)

sablona: https://www.springer.com/gp/computer-science/lncs/conference-proceedings-guidelines?countryChanged=true

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split

## Nacitanie dat

In [152]:
df = pd.read_csv('data/cars_raw.csv').drop('Unnamed: 0', axis=1)
df.head(3)

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5


## Predspracovanie dat
- na zaklade analyzy a popisu chyb

### Konvertovanie atributov na numericke hodnoty *Power, Engine*

In [153]:
def convert_str_to_num(column):
    df[column] = df[column].apply(lambda x: str(x).split()[0])
    df[column] = pd.to_numeric(df[column], errors='coerce')

In [154]:
convert_str_to_num('Power')
convert_str_to_num('Engine')

### Vytvorenie novej crty - *Manufacturer*

In [155]:
df['Manufacturer'] = df['Name'].apply(lambda x: str(x).split()[0])

### Vytvorenie novej crty - *Price_km*
- na zaklade atributu Mileage
- cena paliva za 1 kilometer v indickych rupiach [ https://www.mypetrolprice.com/2/Fuel-prices-in-Delhi ]
- elektro-vozidla nemaju Mileage udaj, ale su len 2

In [156]:
def calculate_price_per_km(df):
    petrol_price = 69.63
    diesel_price = 62.33
    cng_price = 42
    lpg_price = 744
    electric_price = 0

    if df[0]=='Petrol':
        return petrol_price * df[1]
    elif df[0]=='Diesel':
        return diesel_price * df[1]
    elif df[0]=='CNG':
        return cng_price * df[1]
    elif df[0]=='LPG':
        return lpg_price * df[1]

In [157]:
convert_str_to_num('Mileage')

df['Price_km'] = df[['Fuel_Type', 'Mileage']].apply(calculate_price_per_km, axis = 1)

### Odstranenie zbytocnych atributov - *New_Price*

In [158]:
df = df.drop(columns=["New_Price"], axis=1)

## Normalizacia hodnot

### Chybajuce hodnoty a outliery nahradime modusom rovnakeho vyrobcu
- 0 sedadiel nahradime poctom sedadiel rovnakeho auta, ak nie je tak rovnakej znacky
- extremne hodnoty kilometrov doplnime priemerom kilometrov aut rovnakeho roku vyroby (stihli najazdit cca rovnako)

#### 0 sedadiel

In [163]:
df.loc[df['Seats'] == 0.0]

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price,Manufacturer,Price_km


In [160]:
# ine auta s rovnakym menom
df[(df['Name']=='Audi A4 3.2 FSI Tiptronic Quattro') & (df.index != 3999)]

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price,Manufacturer,Price_km


In [161]:
# ine auta s rovnakou znackou
df[df['Manufacturer']=='Audi']['Seats'].value_counts()

5.0    192
7.0     38
4.0      3
2.0      2
0.0      1
Name: Seats, dtype: int64

In [162]:
value = df[df['Manufacturer']=='Audi']['Seats'].mode()[0]
empty_index = df.loc[df['Seats'] == 0.0].index[0]
df.at[empty_index, 'Seats'] = value

#### Extremny pocet kilometrov

In [168]:
df[df['Kilometers_Driven']>1000000]

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price,Manufacturer,Price_km


In [165]:
# priemerne najazdene KM u aut s rovnakym rokom vyroby
df[df['Year']==2017]['Kilometers_Driven'].mean()

44728.38841567291

In [166]:
# pre porovnanie celkovy priemer najazdenych KM
df['Kilometers_Driven'].mean()

58738.38029573019

In [167]:
value = df[df['Year']==2017]['Kilometers_Driven'].mean()
empty_index = df.loc[df['Kilometers_Driven'] > 1000000].index[0]
df.at[empty_index, 'Kilometers_Driven'] = value

## Rozdelenie na train-test

In [169]:
X_train, X_test, y_train, y_test = train_test_split(df.loc[:, df.columns != 'Price'], df['Price'], test_size=0.2, random_state=42)

## Odstranenie outlierov
- pri KM_Driven, Engine, Power
- asi nebude mat vyznam podla analyzy

In [14]:
def quantile_replace(column):
    q01 = X_train[column].quantile(0.01)
    q99 = X_train[column].quantile(0.99)
#     print(len(df[df[column]<q01]),len(df[df[column]>q99]))
    X_train[column] = X_train[column].apply(lambda x: q99 if x > q99 
                                            else q01 if x < q01 else x)
    X_test[column] = X_test[column].apply(lambda x: q99 if x > q99 
                                            else q01 if x < q01 else x)
#     print(q01, q99)

In [15]:
# quantile_replace('Kilometers_Driven')

## Ulozenie a nacitanie opravenych dat

In [171]:
# ulozenie
df.to_csv('data/cars_processed.csv')

In [172]:
# nacitanie a vypis
df = pd.read_csv('data/cars_processed.csv').drop('Unnamed: 0', axis=1)
df.head(3)

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price,Manufacturer,Price_km
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6,998.0,58.16,5.0,1.75,Maruti,1117.2
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67,1582.0,126.2,5.0,12.5,Hyundai,1226.0311
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2,1199.0,88.7,5.0,4.5,Honda,1267.266


## Prvotne experimenty