# Summarizing Automobile Evaluation Data

In the following project we’ll use what we’ve learned about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used for to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field manufacturer_country has been simulated for illustrative purposes.

In [17]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('car_eval_dataset.csv')
df.head()

Unnamed: 0,buying_cost,maintenance_cost,doors,capacity,luggage,safety,acceptability,manufacturer_country
0,vhigh,low,4,4,small,med,unacc,China
1,vhigh,med,3,4,small,high,acc,France
2,med,high,3,2,med,high,unacc,United States
3,low,med,4,more,big,low,unacc,United States
4,low,high,2,more,med,high,acc,South Korea


In [18]:
# a table of proportions for countries that appear in manufacturer_country in the dataset.
manufacturer_country_percent = df.manufacturer_country.value_counts(normalize=True) * 100
print(manufacturer_country_percent)
print(df.buying_cost.unique())

Japan            22.8
Germany          21.8
South Korea      15.9
United States    13.8
Italy             9.7
France            8.7
China             7.3
Name: manufacturer_country, dtype: float64
['vhigh', 'med', 'low', 'high']
Categories (4, object): ['low' < 'med' < 'high' < 'vhigh']


# Summarizing Buying Costs

In [22]:
# Finding Median Buying Cost
df['buying_cost'] = pd.Categorical(df['buying_cost'],['low','med','high','vhigh'],ordered =True)
print(np.median(df.buying_cost.cat.codes))
df.buying_cost.cat.codes.value_counts()

1.0


1    262
0    249
2    245
3    244
dtype: int64

1 is the median which correspond to medium buying cost in our Automobile Evaluation Data .

# Summarizing Luggage Capacity

In [31]:
print(df.luggage.unique())
df.luggage = pd.Categorical(df.luggage,['small','med','big'],ordered=True)
print(df.luggage.cat.codes.value_counts(normalize =True)*100)
print(np.median(df.luggage.cat.codes))

['small', 'med', 'big']
Categories (3, object): ['small' < 'med' < 'big']
0    33.9
1    33.3
2    32.8
dtype: float64
1.0


1 is the median which correspond to medium sized luggage in our Automobile Evaluation Data.

# Summarizing Passenger Capacity

In [35]:
print(df.doors.unique())
df.doors = pd.Categorical(df.doors,['2','3','4','5more'],ordered =True)
print(df.doors.cat.codes.value_counts(normalize =True)*100)
print(np.median(df.doors.cat.codes))

['4', '3', '2', '5more']
Categories (4, object): ['2' < '3' < '4' < '5more']
2    26.3
1    25.2
3    24.6
0    23.9
dtype: float64
2.0


2 is the median which correspond to 4 doors in our Automobile Evaluation Data.

# Summarizing Passenger Safety

In [37]:
print(df.safety.unique())
df.safety = pd.Categorical(df.safety,['low','med','high'],ordered =True)
print(df.safety.cat.codes.value_counts(normalize =True)*100)
print(np.median(df.safety.cat.codes))

['med' 'high' 'low']
0    34.2
1    33.7
2    32.1
dtype: float64
1.0


1 is the median which correspond to medium Safety in our Automobile Evaluation Data.