# Logistic Regression 
The dataset I will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, and how fast the car accelerates. Using this information we will predict the origin of the vehicle, either North America, Europe, or Asia.

Here are the columns in the dataset:

- mpg -- Miles per gallon, Continuous.
- cylinders -- Number of cylinders in the motor, Integer, Ordinal, and Categorical.
- displacement -- Size of the motor, Continuous.
- horsepower -- Horsepower produced, Continuous.
- weight -- Weights of the car, Continuous.
- acceleration -- Acceleration, Continuous.
- year -- Year the car was built, Integer and Categorical.
- origin -- Integer and Categorical. 1: North America, 2: Europe, 3: Asia.
- car_name -- Name of the car.



In [1]:
import pandas as pd
import numpy as np
cars = pd.read_csv("C:/Users/Jennifer/Documents/Python/Data/auto.csv")
cars.head()
unique_regions = cars["origin"].unique()
print(unique_regions)

[1 3 2]


# Dummy Variables


In [2]:
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
cars = pd.concat([cars, dummy_cylinders], axis=1)
dummy_years = pd.get_dummies(cars["year"], prefix="year")
cars = pd.concat([cars, dummy_years], axis=1)
cars = cars.drop("year", axis=1)
cars = cars.drop("cylinders", axis=1)
print(cars.head())

    mpg  displacement horspower  weight  acceleration  origin  \
0  18.0         307.0       130    3504          12.0       1   
1  15.0         350.0       165    3693          11.5       1   
2  18.0         318.0       150    3436          11.0       1   
3  16.0         304.0       150    3433          12.0       1   
4  17.0         302.0       140    3449          10.5       1   

                    car_name  cyl_3  cyl_4  cyl_5   ...     year_73  year_74  \
0  chevrolet chevelle malibu      0      0      0   ...           0        0   
1          buick skylark 320      0      0      0   ...           0        0   
2         plymouth satellite      0      0      0   ...           0        0   
3              amc rebel sst      0      0      0   ...           0        0   
4                ford torino      0      0      0   ...           0        0   

   year_75  year_76  year_77  year_78  year_79  year_80  year_81  year_82  
0        0        0        0        0        0      

# Multiclass Classification


In [4]:
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
highest_train_row = int(cars.shape[0] * .70)
train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]

# Training MultiClass Regression Model
n the one-vs-all approach, we're essentially converting an n-class (in our case n is 3) classification problem into n binary classification problems. For our case, we'll need to train 3 models:

A model where all cars built in North America are considered Positive (1) and those built in Europe and Asia are considered Negative (0).
A model where all cars built in Europe are considered Positive (1) and those built in North America and Asia are considered Negative (0).
A model where all cars built in Asia are labeled Positive (1) and those built in North America and Europe are considered Negative (0).
Each of these models is a binary classification model that will return a probability between 0 and 1. When we apply this model on new data, a probability value will be returned from each model (3 total). For each observation, we choose the label corresponding to the model that predicted the highest probability.

In [5]:
from sklearn.linear_model import LogisticRegression

unique_origins = cars["origin"].unique()
unique_origins.sort()

models = {}
features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]

for origin in unique_origins:
    model = LogisticRegression()
    
    X_train = train[features]
    y_train = train["origin"] == origin

    model.fit(X_train, y_train)
    models[origin] = model

# Testing the Models


In [6]:
testing_probs = pd.DataFrame(columns=unique_origins)
testing_probs = pd.DataFrame(columns=unique_origins)  

for origin in unique_origins:
    # Select testing features.
    X_test = test[features]   
    # Compute probability of observation being in the origin.
    testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]

# Choosing the Origin
Now that we trained the models and computed the probabilities in each origin we can classify each observation. To classify each observation we want to select the origin with the highest probability of classification for that observation.

While each column in our dataframe testing_probs represents an origin we just need to choose the one with the largest probability. We can use the Dataframe method .idxmax() to return a Series where each value corresponds to the column or where the maximum value occurs for that observation. We need to make sure to set the axis paramater to 1 since we want to calculate the maximum value across columns. Since each column maps directly to an origin the resulting Series will be the classification from our model.

In [7]:
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)

0      1
1      1
2      1
3      3
4      2
5      2
6      1
7      1
8      1
9      1
10     2
11     2
12     3
13     1
14     1
15     2
16     1
17     1
18     1
19     1
20     2
21     1
22     1
23     1
24     1
25     1
26     3
27     3
28     1
29     1
      ..
90     2
91     1
92     1
93     1
94     1
95     1
96     2
97     3
98     1
99     1
100    1
101    1
102    1
103    2
104    2
105    1
106    1
107    1
108    1
109    2
110    2
111    3
112    1
113    2
114    3
115    3
116    1
117    2
118    1
119    1
Length: 120, dtype: int64
