# Summarizing Automobile Evaluation Data

In the following project you'll use what you've learned about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field `manufacturer_country` has been simulated for illustrative purposes. You can read more about the details, features, and original uses of this dataset in research on the [UCI data description page](https://archive.ics.uci.edu/ml/datasets/car+evaluation).

## Summarizing Manufacturing Country

1. `manufacturer_country` is a _nominal categorical variable_ that indicates the country of the manufacturer of each car reviewed. Create a table of frequencies of all the cars reviewed by `manufacturer_country`. What is the modal category? Which country appears 4th most frequently? Print out your results.

In [1]:
import pandas as pd
import numpy as np

car_eval = pd.read_csv('car_eval_dataset.csv')
print(car_eval.head())

  buying_cost maintenance_cost doors capacity luggage safety acceptability  \
0       vhigh              low     4        4   small    med         unacc   
1       vhigh              med     3        4   small   high           acc   
2         med             high     3        2     med   high         unacc   
3         low              med     4     more     big    low         unacc   
4         low             high     2     more     med   high           acc   

  manufacturer_country  
0                China  
1               France  
2        United States  
3        United States  
4          South Korea  


In [2]:
# create a table of frequenciee
freq_table = car_eval.manufacturer_country.value_counts()
print(freq_table)

Japan            228
Germany          218
South Korea      159
United States    138
Italy             97
France            87
China             73
Name: manufacturer_country, dtype: int64


In [3]:
# which country appears 4th most frequently
print(f'The country that appears 4th most frequently is the United States.')

The country that appears 4th most frequently is the United States.


2. Calculate a table of proportions for countries that appear in `manufacturer_country` in the dataset. What percentage of cars were manufactured in Japan?

In [4]:
# note table of frequencies takes the count of observations and
# table of proportions takes the proportion each value represents
# in the total
prop_table = car_eval.manufacturer_country.value_counts(normalize=True)
print(prop_table)

Japan            0.228
Germany          0.218
South Korea      0.159
United States    0.138
Italy            0.097
France           0.087
China            0.073
Name: manufacturer_country, dtype: float64


In [5]:
# what percentage of cars were manufactured in Japan?
jpn_per = prop_table.iloc[0]
print(f'Japan manufactured {jpn_per} of the cars.')

Japan manufactured 0.228 of the cars.


## Summarizing Buying Costs

3. `buying_cost` is a categorical variable which describes the cost of buying any car in the dataset. Print out a list of the possible values for this variable.

In [6]:
# print out a list of the possible values for buying_cost
print(car_eval.buying_cost.unique())

['vhigh' 'med' 'low' 'high']


4. `buying_cost` is an _ordinal categorical variable_, which means we can create an order associated with the values in the data and perform numeric operations on the variable. Create a new list, `buying_cost_categories`, that contains the unique values in `buying_cost`, ordered from lowest to highest.

In [7]:
buying_cost_categories = ['low', 'med', 'high', 'vhigh']
print(buying_cost_categories)

['low', 'med', 'high', 'vhigh']


5. Convert `buying_cost` to type `'category'` using the list you created in the previous exercise.

In [8]:
# convert to type 'category'. This would allow us to perform numeric
# operations on categorical data
car_eval["buying_cost"] = pd.Categorical(
    car_eval["buying_cost"], 
    buying_cost_categories,
    ordered = True
)

# check that the column has type category
print(car_eval.buying_cost)

0      vhigh
1      vhigh
2        med
3        low
4        low
       ...  
995      low
996      low
997    vhigh
998      low
999      low
Name: buying_cost, Length: 1000, dtype: category
Categories (4, object): ['low' < 'med' < 'high' < 'vhigh']


6. Calculate the median category of the `buying_cost` variable and print the result.

In [9]:
# np.median() in conjunction with .cat.codes attribute to get numerical
# values of the categories
median_category_num = np.median(car_eval['buying_cost'].cat.codes)
median_category = buying_cost_categories[int(median_category_num)]
print(f'The median category of the buying_cost is {median_category_num} which corresponds to {median_category} category.')

The median category of the buying_cost is 1.0 which corresponds to med category.


## Summarizing Luggage Capacity

7. `luggage` is a categorical variable in the car evaluations dataset that records the luggage capacity for each reviewed car. Calculate a table of proportions for this variable and print the result.

In [10]:
prop_lug_table = car_eval.luggage.value_counts(normalize=True)
print(prop_lug_table)

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


8. Are there any missing values in this column? Replicate the table of proportions from the previous exercise, but do not drop any missing values from the count. Print the result.

In [11]:
print(car_eval.luggage.value_counts(dropna=False, normalize=True))

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


Since both table of proportions are the same, this suggests that there were no missing values.

9. Without passing `normalize = True` to `.value_counts()`, can you replicate the result you got in the previous exercises?

In [12]:
# replicate the result of previous exercise
print(car_eval.luggage.value_counts()/len(car_eval.luggage))

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


In [13]:
# above method only works if no null values
# if there are nulls use .count() which excludes the nulls
print(car_eval.luggage.value_counts()/car_eval.luggage.count())

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


## Summarizing Passenger Capacity

10. `doors` is a categorical variable in the car evaluations dataset that records the count of doors for each reviewed car. Find the count of cars that have 5 or more doors. You can identify cars with 5+ doors by looking for cars that have a value of `'5more'` in the `doors` column. Print your result.

In [14]:
frequency = (car_eval.doors == '5more').sum()
print(f'The amount of cars that have 5 or more doors is {frequency}.')

The amount of cars that have 5 or more doors is 246.


11. Find the proportion of cars that have 5+ doors and print the result.

In [15]:
proportion = (car_eval.doors == '5more').mean()
print(f'The proportion of cars that have more than 5 or more doors is {proportion}.')

The proportion of cars that have more than 5 or more doors is 0.246.
