# Logistic Regression

#### To classify News articles as belonging to two categories - business or sports (binary classification)

Any editing needs to be done only in the cells marked with "Tune hyperparameters here"




Useful notebook shortcuts:

Ctrl+Enter -> Run current cell

Shift+Enter -> Run current cell and go to next cell

Alt+Enter -> Run current cell and add new cell below

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
from model import *
from feature import *
from utils import *

### Load the training data

In [2]:
# Change the path to the training data directory
sports = readfiles('train/sports')
business = readfiles('train/business')

### Prepare bag-of-words based features (for the whole data)

#### To run this block: 

Complete the `preprocess` and `extract` function in the `BagOfWordsFeatureExtractor` class in `feature.py`.

In [43]:
# Initialize the model and preprocess
bow = BagOfWordsFeatureExtractor()
bow.preprocess(business + sports)

In [44]:
# Extract fetures and create a numpy array of features
X_business_bow = bow.extract(business)
X_sports_bow = bow.extract(sports)
X_data_bow = np.concatenate((X_business_bow, X_sports_bow))

# We know the correct labels, so create a numpy array for correct labels
# Business -> 0, Sports -> 1
Y_data_bow = np.concatenate((np.zeros(X_business_bow.shape[0]), np.ones(X_sports_bow.shape[0])))

### Splitting into train and val sets

General convention is to have a train/val split in the ratio of 70/30.

In [45]:
num_train = X_data_bow.shape[0]
num_val = np.int(num_train*0.3)

# Permute the indices to randomly split data into train and val
data_idxs = np.random.permutation(num_train)

# Separate train data
X_train_bow = X_data_bow[data_idxs[num_val:], :]
Y_train_bow = Y_data_bow[data_idxs[num_val:]]

# Separate val data
X_val_bow = X_data_bow[data_idxs[:num_val], :]
Y_val_bow = Y_data_bow[data_idxs[:num_val]]

### Tune hyperparameters here
The model will get trained and the loss at various iterations will be printed. Try and reduce this loss through hyperparameter tuning, to get a better validation set accuracy in the next block. However, don't chase the number to a 1.0, as the focus is on implementation rather than winning a contest.

### To run this block: 
Complete the following functions in `model.py` of class `LogisticRegression`
1. `loss`
2. `train`
3. `sigmoid`

In [46]:
lr = 0.0005               # Try changing the learning rate
reg_const = 0.5          # Try changing the regularization constant
add_bias = True        # Does adding bias help? Try changing between True and False
num_iter = 10000        # Do not change

model1 = LogisticRegression(add_bias) 
model1.train(X_train_bow, Y_train_bow, lr, reg_const)

Iteration 0 of 10000. Loss: 1.070766 LR: 0.000500
Iteration 100 of 10000. Loss: 0.950779 LR: 0.000500
Iteration 200 of 10000. Loss: 0.947475 LR: 0.000500
Iteration 300 of 10000. Loss: 0.944221 LR: 0.000500
Iteration 400 of 10000. Loss: 0.941018 LR: 0.000500
Iteration 500 of 10000. Loss: 0.937864 LR: 0.000500
Iteration 600 of 10000. Loss: 0.934759 LR: 0.000500
Iteration 700 of 10000. Loss: 0.931703 LR: 0.000500
Iteration 800 of 10000. Loss: 0.928694 LR: 0.000500
Iteration 900 of 10000. Loss: 0.925733 LR: 0.000500
Iteration 1000 of 10000. Loss: 0.922818 LR: 0.000500
Iteration 1100 of 10000. Loss: 0.919950 LR: 0.000500
Iteration 1200 of 10000. Loss: 0.917128 LR: 0.000500
Iteration 1300 of 10000. Loss: 0.914351 LR: 0.000500
Iteration 1400 of 10000. Loss: 0.911619 LR: 0.000500
Iteration 1500 of 10000. Loss: 0.908931 LR: 0.000500
Iteration 1600 of 10000. Loss: 0.906288 LR: 0.000500
Iteration 1700 of 10000. Loss: 0.903687 LR: 0.000500
Iteration 1800 of 10000. Loss: 0.901130 LR: 0.000500
Itera

### Validation Set Accuracy using Bag-of-words features

We use the function `score` of class `LogisticRegression` in the file `model.py`.

In [47]:
val_acc = model1.score(X_val_bow, Y_val_bow)
print("Final Validation Set Accuracy - ", val_acc)

Final Validation Set Accuracy -  0.9822222222222222


### Hyperparameters for above model get recorded

These hyperparameters will be your submission.

In [48]:
save_hyper('bow', add_bias, lr, reg_const)

### Prepare Tf-Idf based features (for the whole data)

#### To run this block: 
Complete the `preprocess` and `extract` functions in the `TfIdfFeatureExtractor` class in `feature.py`.

In [18]:
# Initialize the model and preprocess
tfidf = TfIdfFeatureExtractor()
tfidf.preprocess(business + sports)

In [19]:
# Extract fetures and create a numpy array of features
X_business_tfidf = tfidf.extract(business)
X_sports_tfidf = tfidf.extract(sports)
X_data_tfidf = np.concatenate((X_business_tfidf, X_sports_tfidf))

# We know the correct labels, so create a numpy array for correct labels
# Business -> 0, Sports -> 1
Y_data_tfidf = np.concatenate((np.zeros(X_business_tfidf.shape[0]), np.ones(X_sports_tfidf.shape[0])))

### Splitting into train and val sets

General convention is to have a train/val split in the ratio of 70/30.

In [20]:
num_train = X_business_tfidf.shape[0] + X_sports_tfidf.shape[0]
num_val = np.int(num_train*0.3)
X_data_tfidf = np.concatenate((X_business_tfidf, X_sports_tfidf))
Y_data_tfidf = np.concatenate((np.zeros(X_business_tfidf.shape[0]), np.ones(X_sports_tfidf.shape[0])))

# Data_idxs have been used from Bag of words section, so that we can fairly compare accuracies

# Separate train data
X_train_tfidf = X_data_tfidf[data_idxs[num_val:], :]
Y_train_tfidf = Y_data_tfidf[data_idxs[num_val:]]

# Separate val data
X_val_tfidf = X_data_tfidf[data_idxs[:num_val], :]
Y_val_tfidf = Y_data_tfidf[data_idxs[:num_val]]

### Tune hyperparameters here
The model will get trained and the loss at various iterations will be printed. Try and reduce this loss through hyperparameter tuning, to get a better validation set accuracy in the next block. However, don't chase the number to a 1.0, as the focus is on implementation rather than winning a contest.

You should have already implemented all the necessary functions to run this block.

In [40]:
lr = 0.0005               # Try changing the learning rate
reg_const = 0.5         # Try changing the regularization constant
add_bias = True        # Does adding bias help? Try changing between True and False
num_iter = 10000        # Do not change

model2 = LogisticRegression(add_bias)
model2.train(X_train_tfidf, Y_train_tfidf, lr, reg_const)

Iteration 0 of 10000. Loss: 1.052076 LR: 0.000500
Iteration 100 of 10000. Loss: 0.923512 LR: 0.000500
Iteration 200 of 10000. Loss: 0.908375 LR: 0.000500
Iteration 300 of 10000. Loss: 0.893891 LR: 0.000500
Iteration 400 of 10000. Loss: 0.880035 LR: 0.000500
Iteration 500 of 10000. Loss: 0.866779 LR: 0.000500
Iteration 600 of 10000. Loss: 0.854097 LR: 0.000500
Iteration 700 of 10000. Loss: 0.841964 LR: 0.000500
Iteration 800 of 10000. Loss: 0.830357 LR: 0.000500
Iteration 900 of 10000. Loss: 0.819251 LR: 0.000500
Iteration 1000 of 10000. Loss: 0.808625 LR: 0.000500
Iteration 1100 of 10000. Loss: 0.798457 LR: 0.000500
Iteration 1200 of 10000. Loss: 0.788727 LR: 0.000500
Iteration 1300 of 10000. Loss: 0.779415 LR: 0.000500
Iteration 1400 of 10000. Loss: 0.770503 LR: 0.000500
Iteration 1500 of 10000. Loss: 0.761972 LR: 0.000500
Iteration 1600 of 10000. Loss: 0.753806 LR: 0.000500
Iteration 1700 of 10000. Loss: 0.745988 LR: 0.000500
Iteration 1800 of 10000. Loss: 0.738503 LR: 0.000500
Itera

### Validation Set Accuracy using Bag-of-words features

We use the function `score` of class `LogisticRegression` in the file `model.py`.

In [41]:
val_acc = model2.score(X_val_tfidf, Y_val_tfidf)
print("Final Validation Set Accuracy - ", val_acc)

Final Validation Set Accuracy -  0.9922222222222222


### Hyperparameters for above model get recorded

These hyperparameters will be your submission.

In [42]:
save_hyper('tf-idf', add_bias, lr, reg_const)