In [26]:
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

In [18]:
cars = pd.read_csv('data/auto.csv')

In [23]:
unique_origins = cars['origin'].unique()

In [20]:
cars = pd.concat([cars,
                  pd.get_dummies(cars['cylinders'], prefix='cyl'),
                  pd.get_dummies(cars['year'], prefix='year')],
                 axis=1).drop(['cylinders', 'year'], axis=1)

In [21]:
cars.head()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin,cyl_3,cyl_4,cyl_5,cyl_6,...,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82
0,18.0,307.0,130.0,3504.0,12.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,15.0,350.0,165.0,3693.0,11.5,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,18.0,318.0,150.0,3436.0,11.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,16.0,304.0,150.0,3433.0,12.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,17.0,302.0,140.0,3449.0,10.5,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In previous missions, we explored binary classification, where there were only 2 possible categories, or classes. When we have 3 or more categories, we call the problem a multiclass classification problem. There are a few different methods of doing multiclass classification and in this mission, we'll focus on the one-versus-all method.
The one-versus-all method is a technique where we choose a single category as the Positive case and group the rest of the categories as the False case. We're essentially splitting the problem into multiple binary classification problems. For each observation, the model will then output the probability of belonging to each category.
To start let's split our data into a training and test set. We've randomized the cars Dataframe for you already to start things off and assigned the shuffled Dataframe to shuffled_cars.

In [22]:
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
first_70pct = int(shuffled_cars.shape[0] * 0.70)
train = shuffled_cars.iloc[:first_70pct , :]
test = shuffled_cars.iloc[first_70pct:, :]

In the one-vs-all approach, we're essentially converting an n-class (in our case n is 3) classification problem into n binary classification problems. For our case, we'll need to train 3 models:

A model where all cars built in North America are considered Positive (1) and those built in Europe and Asia are considered Negative (0).
A model where all cars built in Europe are considered Positive (1) and those built in North America and Asia are considered Negative (0).
A model where all cars built in Asia are labeled Positive (1) and those built in North America and Europe are considered Negative (0).
Each of these models is a binary classification model that will return a probability between 0 and 1. When we apply this model on new data, a probability value will be returned from each model (3 total). For each observation, we choose the label corresponding to the model that predicted the highest probability.

We'll use the dummy variables we created from the cylinders and year columns to train 3 models using the LogisticRegression class from scikit-learn.

In [45]:
feature_columns = [col for col in train.columns if re.match('^(cyl)|(year)', col)]
models = {}
for origin in unique_origins:
    model = LogisticRegression()
    model.fit(train[feature_columns], train['origin'] == origin)
    models[origin] = model

In [58]:
testing_probs = pd.DataFrame(columns=unique_origins)
for origin in unique_origins:
    model = models[origin]
    testing_probs[origin] = model.predict_proba(test[feature_columns])[:,1]

In [62]:
predicted_origins = testing_probs.idxmax(axis=1)