### Columns:

**General Columns**
* url: url of autos
* short_description, description: Description of autos (in English and German) written by users

**Categorical Columns**
* make_model, make, model: Model of autos. Ex:Audi A1
* body_type, body: Body type of autos Example: van, sedans
* vat: VAT deductible, price negotiable
* registration, first_registration: First registration date and year of autos.
* prev_owner, previous_owners: Number of previous owners
* type: new or used
* next_inspection, inspection_new: information about inspection (inspection date,..)
* body_color, body_color_original: Color of auto Ex: Black, red
* paint_type: Paint type of auto Ex: Metallic, Uni/basic
* upholstery: Upholstery information (texture, color)
* gearing_type: Type of gear Ex: automatic, manual
* fuel : fuel type Ex: diesel, benzine
* co2_emission, emission_class, emission_label: emission information
* drive_chain: drive chain Ex: front,rear, 4WD
* consumption: consumption of auto in city, country and combination (lt/100 km)
* country_version
* entertainment_media
* safety_security
* comfort_convenience
* extras

**Quantitative Columns**
* price: Price of cars
* km: km of autos
* hp: horsepower of autos (kW)
* displacement: displacement of autos (cc)
* warranty: warranty period (month)
* weight: weight of auto (kg)
* nr_of_doors: number of doors
* nr_of_seats : number of seats
* cylinders: number of cylinders
* gears: number of gears



In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import warnings;
warnings.filterwarnings("ignore")
import re
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)

In [None]:
df_org = pd.read_json('./data/scout_car.json', lines=True)
df = df_org.copy()

In [4]:
df.columns

Index(['url', 'make_model', 'short_description', 'body_type', 'price', 'vat',
       'km', 'registration', 'prev_owner', 'kW', 'hp', 'Type',
       'Previous Owners', 'Next Inspection', 'Inspection new', 'Warranty',
       'Full Service', 'Non-smoking Vehicle', 'null', 'Make', 'Model',
       'Offer Number', 'First Registration', 'Body Color', 'Paint Type',
       'Body Color Original', 'Upholstery', 'Body', 'Nr. of Doors',
       'Nr. of Seats', 'Model Code', 'Gearing Type', 'Displacement',
       'Cylinders', 'Weight', 'Drive chain', 'Fuel', 'Consumption',
       'CO2 Emission', 'Emission Class', '\nComfort & Convenience\n',
       '\nEntertainment & Media\n', '\nExtras\n', '\nSafety & Security\n',
       'description', 'Emission Label', 'Gears', 'Country version',
       'Electricity consumption', 'Last Service Date', 'Other Fuel Types',
       'Availability', 'Last Timing Belt Service Date', 'Available from'],
      dtype='object')

### URL

In [5]:
df.drop("url",axis=1,inplace=True)

### Make_Model

In [6]:
df["make_model"].value_counts(dropna=False)

Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: make_model, dtype: int64

In [7]:
df.drop(["Make","Model"],axis=1,inplace=True)

* Since "Make_Model" and "Model-Make Columns" are equal, these columns were dropped.

### Short description 

In [8]:
sd_disp = df['short_description'].str.findall('\d\.\d').str[0].astype(float)
df['sd_disp'] = sd_disp*1000
df['sd_disp'] = df['sd_disp'].replace(1600,1598).replace(1800,1798)

In [9]:
df.loc[df['sd_disp']<800,'sd_disp'] = np.nan

* The displacement values in short description column was extracted.

### Displacement

In [10]:
df['Displacement']=df['Displacement'].str[0].str.replace(",","").str.findall('\d+').str[0].astype("float")

In [11]:
#df["Displacement"]=df["Displacement"].str[0].str.replace("cc","").str.replace(",","").str.strip()
#df["Displacement"]=pd.to_numeric(df["Displacement"])

In [12]:
def disp(d1,d2):
    if (d1>4000) | (d1<700) | np.isnan(d1):
        if np.isnan(d2):
            return d1
        else:
            return d2
    else:
        return d1

In [13]:
df['Displacement'] = df.apply(lambda x: disp(x['Displacement'],x['sd_disp']),axis=1)

In [14]:
df['Displacement'].isnull().sum()

180

In [15]:
df.drop(["short_description","sd_disp"],axis=1,inplace=True)

* Displacement column was cleaned and compared and matched with sd_disp (short description displacement) column. Some null values of displacement column were filled in this way.

### Body Type

In [16]:
df["body_type"].value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64

In [17]:
df.Body.str[1].str.strip().value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: Body, dtype: int64

In [18]:
df.drop("Body",axis=1,inplace=True)

* Displacement column was cleaned

### km

In [19]:
df['km']=df['km'].str.replace(',' , '').str.findall('\d+').str[0].astype(float)

* km column was cleaned and converted to float.

### kW-NULL

In [20]:
df.drop(["kW","null"],axis=1,inplace=True)

* Since there is no meaningful data in kW and null columns they were dropped.

### registration- First Registration

In [21]:
df["registration"]=df["registration"].replace("-/-",np.NaN)

In [22]:
df["registration"]=pd.to_datetime(df["registration"]).dt.year

In [23]:
df.drop("First Registration",axis=1,inplace=True)

* Year parts of registration column were extracted and these values were compared with 'First registration' column. Since all the values were exactly matched, 'First registration' column was dropped.

### Prev_owner-Previous Owners

In [24]:
df['prev_owner']=df['prev_owner'].str.findall("\d+").str[0].astype(float)

In [25]:
df['Previous Owners'] = df['Previous Owners'].astype('str').str.findall('\d+').str[0].astype(float)

In [26]:
df["prev_owner"].value_counts()

1.0    8294
2.0     778
3.0      17
4.0       2
Name: prev_owner, dtype: int64

In [27]:
def prev_owner_combine(p1,p2):
    if p1 == p2:
        return p1
    elif np.isnan(p1) :
        if np.isnan(p2):
            return np.nan
        else:
            return p2
    elif np.isnan(p2):
        if np.isnan(p1):
            return np.nan
        else:
            return p1
    else:
        return 'conflict'

In [28]:
df["prev_owner"]=df.apply(lambda x: prev_owner_combine(x['prev_owner'],x['Previous Owners']), axis=1).value_counts(dropna=False)

In [29]:
df.drop('Previous Owners',axis=1,inplace=True)

* prev_owner and Previous Owners columns were combined and Previous Owners column was dropped.

### hp

In [30]:
df['hp']=df['hp'].map(lambda x: x.rstrip(' kW')).replace("-",np.NaN).astype("float")

* hp column was cleaned and converted to float.

### Type

In [31]:
df.Type=df.Type.str[1].str.strip()

* This column consists of the information about if the car is new, used, pre-registered, demonstration or employee's car. This part cleaned.
* This column also includes data about fuel_type. This data was compared with fuel column. Since they exactly match, fuel datas of type column has not been taken into account.

### Next Inspection

In [32]:
listINS=[]
for i in df["Next Inspection"]:
    if type(i)==float:
        listINS.append(i)
    elif type(i)==list:
        listINS.append(i[0].strip())
    else:
        listINS.append(i.replace("\n",""))

In [33]:
df["Next Inspection"]=listINS

In [34]:
df["Next Inspection"]=pd.to_datetime(df["Next Inspection"])

### Inspection new

In [35]:
def inspection_new(a):
    if type(a)== list:
        return a[0].replace('\n', '')
    elif type(a)== str:
        return a.replace('\n', '')
    else:
        return a

In [36]:
df['Inspection new'] = df['Inspection new'].apply(inspection_new)

* Other parts of this column shows fuel consumption. Since they were the same data in the following consumption column, this part of data was not extracted.

### Warranty

In [37]:
def clean_alt_list(a):
    if type(a) == list:
        b = re.findall(r'\d+', a[0])
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    elif type(a) ==str:
        b = re.findall(r'\d+', a)
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    else:
        return a

In [38]:
df['Warranty'] = df['Warranty'].apply(clean_alt_list)
df = df.rename({'Warranty': 'Warranty(months)'}, axis=1)

* Warranty column was cleaned. In warranty column other parts of column are meaningless.

### Full Service, Non-smoking Vehicle, Offer Number 

In [39]:
df.drop('Full Service',axis=1,inplace=True)

In [40]:
df.drop('Non-smoking Vehicle',axis=1,inplace=True)

In [41]:
df.drop('Offer Number',axis=1,inplace=True)

* Since there is no meaningful data in these columns, they were dropped.

### Body Color -Body Color Original

In [42]:
df["Body Color"]=df["Body Color"].str[1]

In [43]:
df["Body Color Original"]=df["Body Color Original"].str[0]

In [44]:
df["Body Color Original"]=df["Body Color Original"].str.strip()

### Nr. of Doors -Nr. of Seats -Displacement - Gearing Type-Model Code-Drive chain-Paint Type-Emission Class-Cylinders


In [45]:
df["Paint Type"]=df["Paint Type"].str[0].str.strip()

In [46]:
df["Nr. of Seats"]=df["Nr. of Seats"].str[0].str.strip()
df["Nr. of Seats"]=pd.to_numeric(df["Nr. of Seats"])

In [47]:
df["Gearing Type"]=df["Gearing Type"].str[1].str.strip()

In [48]:
df["Nr. of Doors"]=df["Nr. of Doors"].str[0].str.strip()
df["Nr. of Doors"]=pd.to_numeric(df["Nr. of Doors"])

In [49]:
df["Model Code"]=df["Model Code"].str[0].str.strip()

In [50]:
df["Drive chain"]=df["Drive chain"].str[0].str.strip()

In [51]:
df["Cylinders"]=df["Cylinders"].str[0].str.strip()

In [52]:
df["Cylinders"]=pd.to_numeric(df["Cylinders"])

In [53]:
df['Weight'] = df['Weight'].str[0].str.replace(',','').str.findall('\d+').str[0].astype('float')

### Upholstery

In [54]:
df['upholstery_material']=df['Upholstery'].str[0].str.replace('\n','').str.split(', ').str[0]

In [55]:
list_color = ['Black','Grey','Brown','Beige', 'Blue', 'White']
for i in list_color:
    df['upholstery_material'] = df['upholstery_material'].replace(i,np.nan)

In [56]:
df['upholstery_color'] = df['Upholstery'].str[0].str.replace('\n','').str.replace(', ','')

list_uph_mat = ['Cloth', 'Part leather', 'Full leather', 'Other', 'Velour', 'alcantara']
for i in list_uph_mat:
    df['upholstery_color'] = df['upholstery_color'].str.replace(i,'')
df['upholstery_color'] = df['upholstery_color'].replace('',np.nan)

### Fuel

In [57]:
df["Fuel"]=df["Fuel"].str[1]

In [58]:
def func2(x):
    if 'Diesel' in x:
        return 'Diesel'
    elif 'Super' in x:
        return 'Benzine'
    elif 'Gasoline' in x:
        return 'Benzine'
    elif 'Benzine' in x:
        return 'Benzine'
    else:
        return x

In [59]:
df["Fuel"]=df["Fuel"].apply(func2)

In [60]:
df['Fuel'].replace(['CNG','LPG', 'Liquid petroleum gas (LPG)','CNG (Particulate Filter)', 'Domestic gas H','Biogas'], 'Gas',inplace = True)

In [61]:
df['Fuel'].replace(['Others (Particulate Filter)','Others','Electric'], 'Others',inplace = True)

### Consumption

In [62]:
def consume_combined(a):
    if type(a) == list:
        if len(a) >3:
            for i in a:
                if 'comb' in i:
                    return i
        else:
            return a[0]            
    
    else:
        return a
    
df['Consumption_combined'] = df['Consumption'].apply(consume_combined)

In [63]:
def cleaning_consumption(a):
    if type(a) == list:
        if len(a) > 0:
            b = re.findall("\d\.?\d?", a[0])
            return b[0]
        else:
            return np.nan
    elif type(a) == str:
        b = re.findall("\d\.?\d?",a)
        return b[0]        
    else:
        return a

In [64]:
def consume_city(a):
    if type(a) == list:
        if len(a) >3:
            for i in a:
                if 'city' in i:
                    return i
        else:
            return a[1]           
    
    else:
        return a
    
df['Consumption_city'] = df['Consumption'].apply(consume_city)

In [65]:
def consume_country(a):
    if type(a)== list:
        if len(a) >3:
            for i in a:
                if 'country' in i:
                    return i
        else:
            return a[2]            
    
    else:
        return a
    
df['Consumption_country'] = df['Consumption'].apply(consume_country)

In [66]:

df['Consumption_combined'] = df['Consumption_combined'].apply(cleaning_consumption).astype('float')
df['Consumption_city'] = df['Consumption_city'].apply(cleaning_consumption).astype('float')
df['Consumption_country'] = df['Consumption_country'].apply(cleaning_consumption).astype('float')

df.drop("Consumption",axis=1,inplace=True)

### CO2 Emission

In [67]:
df['CO2 Emission'] = df['CO2 Emission'].str[0].str.findall('\d+').str[0].astype('float')

In [68]:
df.rename(columns={'CO2 Emission' : 'CO2 Emission(g CO2/km)'}, inplace = True)

### Emission Class 

In [69]:
df["Emission Class"]=df["Emission Class"].str[0].str.replace("\n","")

In [70]:
df['Emission Class'].replace(['Euro 6','Euro 6d-TEMP','Euro 6d', 'Euro 6c'], 'Euro 6', inplace = True)

In [71]:
df["Emission Class"].value_counts(dropna=False)

Euro 6    12173
NaN        3628
Euro 5       78
Euro 4       40
Name: Emission Class, dtype: int64

### Emission Label, description, "Last Timing Belt Service Date","Electricity consumption", "Available from", "Availability", "Other Fuel Types", "Last Service Date" columns

In [72]:
df.drop("Emission Label",axis=1,inplace=True)

In [73]:
df.drop("description",axis=1,inplace=True)

In [74]:
df.drop("Last Timing Belt Service Date",axis=1,inplace=True)

In [75]:
df.drop("Electricity consumption",axis=1,inplace=True)

In [76]:
df.drop("Available from",axis=1,inplace=True)

In [77]:
df.drop("Availability",axis=1,inplace=True)

In [78]:
df.drop("Other Fuel Types",axis=1,inplace=True)

In [79]:
df.drop("Last Service Date",axis=1,inplace=True)

* Since there was no meaningful data in these columns, they were dropped. 

## Gears column

In [81]:
df["Gears"]=df["Gears"].str[0].str.replace("\n","")#.value_counts(dropna=False)

In [82]:
df["Gears"]=pd.to_numeric(df["Gears"])

In [83]:
df[df["Gears"]==50.0].Gears=5.0

## Country version column

In [82]:
df["Country version"]=df["Country version"].str[0].str.replace("\n","")

## Clean data was saved to a csv file called "auto_scout_clean.csv"

In [83]:
df.to_csv("cleaned_scout_car.csv", index=False)