# Data Wrangling

In this notebook, we will learn about data wrangling: how to prepare the raw data for statistical analysis.
 
Problem: Tom wants to sell his used car. What price he should set ?
<br>To make an estimate of the price he should set, we need
- prices on which used cars were sold and their characteristics
- what features of cars affect their price : brand, horsepower, color, etc

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)

### Importing data (from url)

We import the data using the url. read_csv assumes that the input file has column headers. If it does not, we need to set header = None

In [2]:
#url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
#df = pd.read_csv(url, header = None)

We now assign the column names

In [3]:
#headers = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', \
# 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', \
# 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', \
#'city-mpg', 'highway-mpg', 'price']

#df.columns = headers

In [4]:
#df.head()

Another option is to define url and headers and read from this command

In [5]:
#df = pd.read_csv(url, names = headers) 

Save the dataframe so that we dont have to download it all the time. index = False to avoid index in the first column which is default

In [6]:
#df.to_csv('usedCarsData.csv', index = False)

### Importing data (from saved file)

In [7]:
df = pd.read_csv('usedCarsData.csv')
df.shape

(205, 26)

In [8]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Before moving on to rigourous data analysis, one should first check
 - data types of features: gives idea about type mismatch, and what python methods can be used (depending on whether datatype is numeric or strings)
 - data distribution: provides statistical summary

This gives an overview of the dataset, and also points out potential issues with the dataset such as wrong datatype assigned to features

In [9]:
df.dtypes      # columns datatype 

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

In [10]:
df.describe()       # provides statistical summary for each column, excluding NaN values    

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


In [11]:
df.describe(include = 'all')    # default describe skips rows/columns that do not contain numbers, 
                                # to include them we need to set include = 'all'

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,205,205,205,205,205,205,205,205,205.0,205.0,205.0,205.0,205.0,205,205,205.0,205,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205
unique,,52,22,2,2,3,5,3,2,,,,,,7,7,,8,39.0,37.0,,60.0,24.0,,,187
top,,?,toyota,gas,std,four,sedan,fwd,front,,,,,,ohc,four,,mpfi,3.62,3.4,,68.0,5500.0,,,?
freq,,41,32,185,168,114,96,120,202,,,,,,148,159,,94,23.0,20.0,,19.0,37.0,,,4
mean,0.834146,,,,,,,,,98.756585,174.049268,65.907805,53.724878,2555.565854,,,126.907317,,,,10.142537,,,25.219512,30.75122,
std,1.245307,,,,,,,,,6.021776,12.337289,2.145204,2.443522,520.680204,,,41.642693,,,,3.97204,,,6.542142,6.886443,
min,-2.0,,,,,,,,,86.6,141.1,60.3,47.8,1488.0,,,61.0,,,,7.0,,,13.0,16.0,
25%,0.0,,,,,,,,,94.5,166.3,64.1,52.0,2145.0,,,97.0,,,,8.6,,,19.0,25.0,
50%,1.0,,,,,,,,,97.0,173.2,65.5,54.1,2414.0,,,120.0,,,,9.0,,,24.0,30.0,
75%,2.0,,,,,,,,,102.4,183.1,66.9,55.5,2935.0,,,141.0,,,,9.4,,,30.0,34.0,


## Data Wrangling

Data wrangling (also known as data cleaning or data preprocessing) is the process of converting raw data into a proper format for further analysis. 
<br>This process involves:
 - Identify and handle missing values: a missing value case happens when an entry is left empty
 - Data formatting: raw data from different sources can be in different units, formats and conventions (NY, New York)
 - Data normalization: various columns have varying range of numbers. It should be  brought into similar range for meaningful comparisons
 - Data binning: grouping data together
 - Turning categorical values into numeric values

### 1. Missing Values

We can see several question marks in the dataframe, those are missing values. 
<br>These missing values may hinder our data analysis. We need to identify and deal with these missing values.
<br>Depending on the situation, there are many options to deal with missing values
 - ask the data collection team to provide it if possible
 - drop the missing value, two possibilities:
   - drop the variable (column): should be done only when most of the entries in the column are missing
   - drop that particular data entry (row)
 - replace the missing value:
   - replace by average value of entire variable
   - replace by mode (for categorical values such as 'fuel-type')
   - infer it from the values of other features
   - leave the missing data as missing data : sometimes it beneficial to just let it remain a missing entry

Convert '?' to NaN: we replace all missing data ('?') by 'NaN' which is Python's default missing value marker

In [12]:
df.replace('?', np.nan, inplace = True)

In [13]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Now that missing values are converted to Python's default, we can use Python's built-in functions to identify these missing values. 
<br>There are two methods to detect missing data :

In [14]:
missing_data = df.isnull()   # returns a boolean dataframe where NaN entries become True 
missing_data.head()          # and present entries become False

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [15]:
present_data = df.notnull()     # returns a boolean dataframe where NaN entries become False 
present_data.head()             # and present entries become True

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
2,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True


We can figure out the number of missing values in each column using a for loop

In [16]:
for column in missing_data.columns.values.tolist():     # a loop over all columns in the dataframe
    print(column)
    print(missing_data[column].value_counts())          # counts the number of True/False in each column
    print("")

symboling
False    205
Name: symboling, dtype: int64

normalized-losses
False    164
True      41
Name: normalized-losses, dtype: int64

make
False    205
Name: make, dtype: int64

fuel-type
False    205
Name: fuel-type, dtype: int64

aspiration
False    205
Name: aspiration, dtype: int64

num-of-doors
False    203
True       2
Name: num-of-doors, dtype: int64

body-style
False    205
Name: body-style, dtype: int64

drive-wheels
False    205
Name: drive-wheels, dtype: int64

engine-location
False    205
Name: engine-location, dtype: int64

wheel-base
False    205
Name: wheel-base, dtype: int64

length
False    205
Name: length, dtype: int64

width
False    205
Name: width, dtype: int64

height
False    205
Name: height, dtype: int64

curb-weight
False    205
Name: curb-weight, dtype: int64

engine-type
False    205
Name: engine-type, dtype: int64

num-of-cylinders
False    205
Name: num-of-cylinders, dtype: int64

engine-size
False    205
Name: engine-size, dtype: int64

fuel-system
Fa

We have some freedom in choosing which method to use to replace data. However, some methods may seem more reasonable than others. 
<br>We will apply each method to many different columns:

For below five columns, we calculate the mean of the desired column and replace the missing values in that column by this mean value

In [17]:
avg_nl = df['normalized-losses'].astype('float').mean(axis = 0)
df['normalized-losses'].replace(np.nan, avg_nl, inplace = True)  

avg_bore = df['bore'].astype('float').mean(axis = 0)
df['bore'].replace(np.nan, avg_bore, inplace = True)                     

avg_stroke = df['stroke'].astype('float').mean(axis = 0)
df['stroke'].replace(np.nan, avg_stroke, inplace = True)

avg_hp = df['horsepower'].astype('float').mean(axis = 0)
df['horsepower'].replace(np.nan, avg_hp, inplace = True)

avg_peak_rpm = df['peak-rpm'].astype('float').mean(axis = 0)
df['peak-rpm'].replace(np.nan, avg_peak_rpm, inplace = True)

Now, for the columns where we want to replace the missing values by the mode, we need to know the frequency of occuring of values. 
<br>We use value_counts() method to obtain this information :

In [18]:
df['num-of-doors'].value_counts()   # returns values with respective frequency

four    114
two      89
Name: num-of-doors, dtype: int64

In [19]:
df['num-of-doors'].value_counts().idxmax()  # returns the value with highest frequency

'four'

Now that four doors are the most common type, we replace the missing values in this column by four

In [20]:
df['num-of-doors'].replace(np.nan, 'four', inplace = True)

Now, we drop the rows where price data is missing because this is what we want to predict. 
<br>This row is not useful as any data entry without price data can not be used for prediction

In [21]:
df.dropna(subset = ['price'], axis = 0, inplace = True)    # axis = 0 drops the row, while axis = 1 drops the column
df.reset_index(drop = True, inplace = True)                # reset index, because we dropped rows

In [22]:
df.head()
df.shape

(201, 26)

### 2. Data Formatting

Data is usually collected from different places, by different people which may be stored in different formats. 
<br>Data formatting means bringing data into common standard of expression that allows users to make meaningful comparisons.

Formatting examples:
 - We are interested to know how many people are from New York. People will use NY, N.Y, New York etc. We would like to treat them as same.
 - Fuel consumption in US is measured in mpg (miles per gallon). In countries using metric systems, we would like to convert into to L/100kms.

For many reasons (including when we import the data in Python), datatype can be incorrectly established/assigned.
 - For example, a numeric value feature is assigned an object type when importing the data
 - It is important to assign correct datatype, otherwise in later analysis totally valid data will be treated as missing data due to mismatch

In [23]:
df.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

As we can see above, some columns are not of the correct data type. 
<br>Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. 
<br>For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'. However, they are shown as type 'object'. 
<br>We have to convert data types into a proper format for each column using the "astype()" method.

In [24]:
df[['bore', 'stroke']] = df[['bore', 'stroke']].astype('float')
df[['normalized-losses']] = df[['normalized-losses']].astype('int')
df[['price']] = df[['price']].astype('float')
df[['peak-rpm']] = df[['peak-rpm']].astype('float')

In [25]:
df.dtypes

symboling              int64
normalized-losses      int64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower            object
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

In [26]:
df['city-L/100km'] = 235/df['city-mpg']     # transforms mpg to L/100kms: this creates a new column

In [27]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km
0,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000.0,21,27,13495.0,11.190476
1,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000.0,21,27,16500.0,11.190476
2,1,122,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000.0,19,26,16500.0,12.368421
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500.0,24,30,13950.0,9.791667
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500.0,18,22,17450.0,13.055556


In [28]:
df['highway-mpg'] = 235/df['highway-mpg']                                # transforms in the same column
df.rename(columns = {'highway-mpg' : 'highway-L/100km'}, inplace = True) # rename the column

In [29]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-L/100km,price,city-L/100km
0,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000.0,21,8.703704,13495.0,11.190476
1,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000.0,21,8.703704,16500.0,11.190476
2,1,122,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000.0,19,9.038462,16500.0,12.368421
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500.0,24,7.833333,13950.0,9.791667
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500.0,18,10.681818,17450.0,13.055556


### 3. Data Normalization

Feature data in various columns are in varying range typically. 
<br>For proper statistical analysis, it is required to have them in similar range so that all features have comparable influence on modelling. 
<br>Without normalization, a feature with larger values will have greater influence on modeling (eg. linear regression) compared to a feature will lesser values. 
<br>This does not mean that the former feature is more important in making predictions. To avoid this, we need to normalize the feature values.

Simple feature scaling: divide each value by the maximum value of that feature

In [30]:
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()

Min-Max scaling: subtract the minimum value from the value and divide by the range: (x - x_min)/(x_max - x_min). Values become in the range [0,1]

In [31]:
df['length'] = (df['length'] - df['length'].min())/(df['length'].max() - df['length'].min())

Z-score scaling (standard score): subtract mean from each value and divide by standard deviation: x-mu/sigma.
<br> Values hover around 0 and typically in range [-3,3]

In [32]:
df['length'] = (df['length']-df['length'].mean())/df['length'].std()

In [33]:
df['length']

0     -0.438315
1     -0.438315
2     -0.243544
3      0.194690
4      0.194690
         ...   
196    1.184775
197    1.184775
198    1.184775
199    1.184775
200    1.184775
Name: length, Length: 201, dtype: float64

### 4. Binning

Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis. 
<br>Sometimes, binning can improve accuracy of the predictive models. Grouping data together helps in better understanding of data distribution.

Here, we bin the used cars into three categories depending on their horsepower: high horsepower, medium horsepower, low horsepower.

In [34]:
df['horsepower'] = df['horsepower'].astype(float, copy=True)    # convert data to correct format

We would like four bins of equal size bandwidth. The fourth is because the function "cut" includes the rightmost value:

In [35]:
bins = np.linspace(min(df['horsepower']), max(df['horsepower']), 4)
group_names = ['Low', 'Medium', 'High']                # set group names

Function cut determines where each value of df['horsepower'] belongs

In [36]:
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels = group_names, include_lowest = True)

In [37]:
df[['horsepower', 'horsepower-binned']].head(20)

Unnamed: 0,horsepower,horsepower-binned
0,111.0,Low
1,111.0,Low
2,154.0,Medium
3,102.0,Low
4,115.0,Low
5,110.0,Low
6,110.0,Low
7,110.0,Low
8,140.0,Medium
9,101.0,Low


### 5. Categorical Variables

For statistical modeling, we need to represent categorical variables by numeric values, as we can not use categorical values for modeling.
<br>In this used car dataset, fuel-type has two values, fuel-type = gas or diesel

Categorical values are encoded into a numerical value using a method called 'one-hot encoding':
 - New features (columns) will be created, corresponding to each unique element in the original feature
 - The new feature value will be 1 if it matches with the value in the original column, otherwise zero
 - For example, if the value of fuel-type = gas in that particular row, feature gas = 1, feature diesel = 0 for that particular row
 - New feature variables are also known as dummy variable or indicator variable

We use the pandas method 'get_dummies' for one-hot encoding

In [38]:
dummy_variable_fuel_type = pd.get_dummies(df['fuel-type'])
dummy_variable_fuel_type.head()

Unnamed: 0,diesel,gas
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


We now concatenate this new feature dataframe to the original dataframe, and drop the fuel-type column

In [39]:
df = pd.concat([df, dummy_variable_fuel_type], axis = 1)
df.drop('fuel-type', axis = 1, inplace = True)

In [40]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-L/100km,price,city-L/100km,horsepower-binned,diesel,gas
0,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,-0.438315,0.890278,0.816054,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,8.703704,13495.0,11.190476,Low,0,1
1,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,-0.438315,0.890278,0.816054,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,8.703704,16500.0,11.190476,Low,0,1
2,1,122,alfa-romero,std,two,hatchback,rwd,front,94.5,-0.243544,0.909722,0.876254,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,9.038462,16500.0,12.368421,Medium,0,1
3,2,164,audi,std,four,sedan,fwd,front,99.8,0.19469,0.919444,0.908027,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,7.833333,13950.0,9.791667,Low,0,1
4,2,164,audi,std,four,sedan,4wd,front,99.4,0.19469,0.922222,0.908027,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,10.681818,17450.0,13.055556,Low,0,1


In [41]:
dummy_variable_aspiration = pd.get_dummies(df['aspiration'])
dummy_variable_aspiration.rename(columns = {'std':'aspiration-std', 'turbo':'aspiration-turbo'}, inplace = True)
dummy_variable_aspiration.head()

Unnamed: 0,aspiration-std,aspiration-turbo
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0


In [42]:
df = pd.concat([df, dummy_variable_aspiration], axis=1)
df.drop('aspiration', axis = 1, inplace=True)

In [43]:
df.to_csv('usedCarsDataClean.csv')