# Summarizing Automobile Evaluation Data

In the following project you'll use what you've learned about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field `manufacturer_country` has been simulated for illustrative purposes. You can read more about the details, features, and original uses of this dataset in research on the [UCI data description page](https://archive.ics.uci.edu/ml/datasets/car+evaluation).

## Summarizing Manufacturing Country

1. `manufacturer_country` is a _nominal categorical variable_ that indicates the country of the manufacturer of each car reviewed. Create a table of frequencies of all the cars reviewed by `manufacturer_country`. What is the modal category? Which country appears 4th most frequently? Print out your results.

In [4]:
import pandas as pd

car_eval = pd.read_csv('car_eval_dataset.csv')
car_eval.head()


Unnamed: 0,buying_cost,maintenance_cost,doors,capacity,luggage,safety,acceptability,manufacturer_country
0,vhigh,low,4,4,small,med,unacc,China
1,vhigh,med,3,4,small,high,acc,France
2,med,high,3,2,med,high,unacc,United States
3,low,med,4,more,big,low,unacc,United States
4,low,high,2,more,med,high,acc,South Korea


In [18]:
#frequency table
print(car_eval.manufacturer_country.value_counts())
#modal 
print('Modal Country:', car_eval.manufacturer_country.value_counts().index[0])
# 4th most frequency
print('4th Most Frequent:', car_eval.manufacturer_country.value_counts().index[4])


Japan            228
Germany          218
South Korea      159
United States    138
Italy             97
France            87
China             73
Name: manufacturer_country, dtype: int64
Modal Country: Japan
4th Most Frequent: Italy


2. Calculate a table of proportions for countries that appear in `manufacturer_country` in the dataset. What percentage of cars were manufactured in Japan?

In [7]:
car_eval.manufacturer_country.value_counts(normalize = True)
# 22.8% of cars manufactured in Japan

Japan            0.228
Germany          0.218
South Korea      0.159
United States    0.138
Italy            0.097
France           0.087
China            0.073
Name: manufacturer_country, dtype: float64

## Summarizing Buying Costs

3. `buying_cost` is a categorical variable which describes the cost of buying any car in the dataset. Print out a list of the possible values for this variable.

In [9]:
car_eval.buying_cost.unique()

array(['vhigh', 'med', 'low', 'high'], dtype=object)

4. `buying_cost` is an _ordinal categorical variable_, which means we can create an order associated with the values in the data and perform numeric operations on the variable. Create a new list, `buying_cost_categories`, that contains the unique values in `buying_cost`, ordered from lowest to highest.

In [10]:
buying_cost_categories = ['low', 'med', 'high', 'vhigh']

5. Convert `buying_cost` to type `'category'` using the list you created in the previous exercise.

In [20]:
car_eval.buying_cost = pd.Categorical(car_eval.buying_cost, buying_cost_categories, ordered = True)

6. Calculate the median category of the `buying_cost` variable and print the result.

In [25]:
import numpy as np
median_buying_cost = np.median(car_eval.buying_cost.cat.codes)
print(median_buying_cost) 
 
median_category = buying_cost_categories[int(median_buying_cost)]
print(median_category)

1.0
med


## Summarizing Luggage Capacity

7. `luggage` is a categorical variable in the car evaluations dataset that records the luggage capacity for each reviewed car. Calculate a table of proportions for this variable and print the result.

In [26]:
car_eval.luggage.value_counts(normalize = True)

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64

8. Are there any missing values in this column? Replicate the table of proportions from the previous exercise, but do not drop any missing values from the count. Print the result.

In [27]:
car_eval.luggage.value_counts(normalize = True, dropna = False)

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64

9. Without passing `normalize = True` to `.value_counts()`, can you replicate the result you got in the previous exercises?

In [30]:
#as there are no NULL values, can use len()
print(car_eval.luggage.value_counts() / len(car_eval.luggage))

#if there had been
print(car_eval.luggage.value_counts() / car_eval.luggage.count())

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64
small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


## Summarizing Passenger Capacity

10. `doors` is a categorical variable in the car evaluations dataset that records the count of doors for each reviewed car. Find the count of cars that have 5 or more doors. You can identify cars with 5+ doors by looking for cars that have a value of `'5more'` in the `doors` column. Print your result.

In [35]:
five_more_count = (car_eval.doors == '5more').sum()
five_more_count

246

11. Find the proportion of cars that have 5+ doors and print the result.

In [38]:
five_more_proportion = (car_eval.doors == '5more').mean()
five_more_proportion

0.246