# Getting Started

**TEXT CLASSIFICATION**



**Dataset :** Natural Language Processing with Disaster Tweets is a Kaggle Challenge where tweets are collected with labels indicating whether the tweets are about a disaster that occurred or not. Since tweets are social media language, therefore, it is a challenge to automatically identify them. Besides, ambiguity in texts makes it more difficult to achieve automatic identification of tweets containing information on real disaster. The objective of this project is to predict using machine learning if a tweet contains information on occurrence of a real disaster or not.

**Source :** https://www.kaggle.com/c/nlp-getting-started



## Imports

In [None]:
import re
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
plt.rcParams['font.size'] = 10
plt.rcParams['axes.titlepad'] = 30 

from collections import Counter
from wordcloud import WordCloud
from nltk import sent_tokenize, word_tokenize, TweetTokenizer
from nltk.corpus import stopwords
import string
from PIL import Image
import PIL.ImageOps
from wordcloud import ImageColorGenerator
import contractions
STOPWORDS = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion

import warnings
warnings.filterwarnings('ignore')

## Data Load

In [None]:
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

# Exploratory Data Analysis

In [None]:
df_train.head()

In [None]:
df_train.shape

**EDA Steps:**
    
1. Data Distribution
2. Missing Values
3. Cardinality for features and target
4. Distribution of Target by keyword
5. Distribution of Target by location
6. Most frequently occurring words

## Cardinality Check

In [None]:
for col in df_train.columns:
    print("{} has {} unique instances".format(col, len(df_train[col].unique())))

There are 7613 unique ids which is in accordance with the dataframe shape.

This is a binar classification task and 'Target' has two classes.

In [None]:
df_train.info(verbose=True, null_counts=True)

## Missing Values

In [None]:
# Get the percentage of missing values
output = df_train.isnull().sum() * 100  / len(df_train) 
# type(output) : Pandas Series

# plot them
plt.figure(figsize=[5,3])
sns.barplot(y=list(output.index), x=list(output))
plt.title('Missing Values by Columns')
plt.show()

Keyword and location has missing values. Location has 33% data missing. This missing data could be tackled by detecting location in tweets and filling them up. I will not be doing this since I will be using tf-idf on the tweet text joined with location. Therefore, this step is not required for modelling though could be useful for EDA.  

In [None]:
df_train[~df_train['location'].isna()]

Evident that the location information is not consistant. It is also possible to extract location information from text as in id 50.

In [None]:
df_train[df_train['id'] == 48]['text']

## Column Names - STATIC

I prefer standardizing column names:

In [None]:
KEYWORD = 'keyword'
ID = 'id'
LOCATION = 'location'
TEXT = 'text'
TARGET = 'target'
TEXT_TOKENIZED = 'Text Tokenized'
SENTIMENT = 'Sentiment Score'
SENTIMENT_ROUND = 'Sentiment Score (rounded off)'
WORDS_PER_TWEET = 'Words Per Tweet'
CHAR_PER_TWEET = 'Characters Per Tweet'
LOCATIONS = 'Locations'
ALL_TEXT = 'all_text'
ALL_TEXT_JOINED = 'all_text_joined'
NUM_IN_TWEETS = 'Number in Tweet'
PUNCTUATION_COUNT = 'Punctuation Count Per Tweet'
IDENTIFIABLE_LOCATION = 'Identifiable Location'
IN_BOW = 'Present In BOW'

## Keyword Analysis

1. Which keywords have occurred the most?
2. Which keywords have a higher percentage of tweets about real disasters?

In [None]:
plt.figure(figsize=[10, 60])
sns.countplot(y=KEYWORD,
              data=df_train,
              palette=['grey'],
              order=df_train[KEYWORD].value_counts().index)
plt.xticks(rotation=90)
plt.title("Keyword Count")
plt.show()

In [None]:
plt.figure(figsize=[10, 60])
sns.countplot(y=KEYWORD, hue=TARGET, data=df_train, palette=['grey', 'red'])
plt.xticks(rotation=90)
plt.legend(loc='upper right')
plt.title("Distribution of Target per Keyword ")
plt.show()

Blank spaces indicated by %20. Will require to clean this.

From this chart, it is seen that 'derailment', 'debris' and 'wreckage' are all about real disaster tweets. 

Body20%bags contains the higest difference between real and non-real disaster tweets where the number of non-real disaster tweets is high. It is actually the highest.

In [None]:
df_train[(df_train[KEYWORD] == 'body%20bags') & (df_train[TARGET] == 1)][TEXT].values

In [None]:
# Get keyword counts for tweets about real disasters
real_disaster_keywords = df_train[df_train['target'] == 1].groupby(['keyword', 'target']).count()['id'].reset_index()
real_disaster_keywords.head()

In [None]:
# Get keyword counts for tweets not about real disasters
unreal_disaster_keywords = df_train[df_train['target'] == 0].groupby(['keyword', 'target']).count()['id'].reset_index()
unreal_disaster_keywords.head()

In [None]:
plt.figure(figsize=[6, 5])
sns.barplot(x=KEYWORD, y=ID, data=real_disaster_keywords.sort_values('id', ascending=False)[:10], palette='gist_heat')
plt.xticks(rotation=60)
plt.ylabel('Count')
plt.title('Top 10 Keywords for Tweets about Real Disasters')
plt.show()

In [None]:
plt.figure(figsize=[6, 5])
sns.barplot(x=KEYWORD, y=ID, data=unreal_disaster_keywords.sort_values('id', ascending=False)[:10], palette='gist_heat')
plt.xticks(rotation=60)
plt.ylabel('Count')
plt.title('Top 10 Keywords for Tweets NOT about Real Disasters')
plt.show()

In [None]:
# Merge the counts and get calculate the probabilities that 
# if a particular keyword appears in a tweet, what is the probability that it is about a real disaster.

merged_counts_keywords = pd.merge(
    left=real_disaster_keywords,
    right=unreal_disaster_keywords,
    left_on=KEYWORD,
    right_on=KEYWORD,
    how='outer').drop(columns=['target_x', 'target_y']).fillna(0)

merged_counts_keywords['prob_real_disasters'] = (
    merged_counts_keywords['id_x'] -
    merged_counts_keywords['id_y']) / merged_counts_keywords['id_x']

top_prob_real_disaster_keywords = merged_counts_keywords.sort_values(
    'prob_real_disasters', ascending=False)[:10]
top_prob_real_disaster_keywords

## Target Analysis

**Imbalanced data** is a problem with classification tasks where the classes are not represented equally.Is the dataset balanced? 

In [None]:
df_train.target.value_counts() / len(df_train)

In [None]:
sns.countplot(y=TARGET,
              data=df_train[TARGET].replace({
                  0: 'Not about Real Disaster',
                  1: 'About Real Disaster'
              }).reset_index(),
              palette=['grey', 'red'])
plt.title('Target Distribution')
plt.ylabel(None)
plt.xlabel('Count')
plt.show()

This dataset is slightly imbalanced which should not be a problem for us. The disprity is of ~1000 datapoints where the number of non-disastrous tweets are higher.

In [None]:
sns.countplot(x=TARGET, data=df_train[~df_train[LOCATION].isna()], palette=['grey', 'red'])
plt.show()

For the tweets whose location is NOT missing, the data imabalance is true in this case as well. Therefore, dropping the location null data will not help balancing the data.

In [None]:
sns.countplot(x=TARGET, data=df_train[df_train[LOCATION].isna()], palette=['grey', 'red'])
plt.show()

For the tweets whose location is missing, the data imabalance is true in this case as well.

## Top 20 locations

Some locations are more prone to real disasters and it is highly likely that tweets from those locations will be about real disasters.

Objectives:

1. Nature of location data
2. Most frequently occurring locations

In [None]:
df_train[df_train[TARGET] == 1].groupby(LOCATION)[TARGET].count().reset_index()

There are some 
* gibberish locations
* latitudes and longitudes
* english words in location

In [None]:
df_train[df_train[TARGET] == 1].groupby(
    LOCATION)[TARGET].count().reset_index().sort_values(by=TARGET)

There are -
* city
* city, state
* city, country
* country abbreviation
* country name
* city, country / worldwide

In [None]:
plt.figure(figsize=[18, 10])
sns.barplot(x=LOCATION,
            y=TARGET,
            data=df_train[df_train[TARGET] == 1].groupby(LOCATION)
            [TARGET].count().reset_index().sort_values(by=TARGET,
                                                       ascending=False)[:20],
            palette='gist_heat'
           )
plt.xticks(rotation=90)
plt.title('Top 20 Most Frequently Appearing Locations for Tweets about Real Disaster')
plt.xlabel('Tweet Post Locations')
plt.ylabel('Counts')
plt.show()

There is an overlap of countries, cities and there are also co-ordinate information plus some gibberish data.

## Text Data Check

Here is have random tweet text checks  to see what is in there

In [None]:
df_train[TEXT][8], df_train[TARGET][8]

In [None]:
df_train[TEXT][20], df_train[TARGET][20]

In [None]:
df_train[TEXT][1000] , df_train[TARGET][1000]

In [None]:
df_train[TEXT][2000] , df_train[TARGET][2000]

## Test Data Check

In [None]:
df_test.head()

In [None]:
df_test.isnull().sum()

**Notes:**

To remove:
1. urls from the texts, 
2. html tags
3. mentions using @.
4. %20 from keywords

Will retain hashtags since importance information lies in hasgtags but will remove the # in them.

# Basic Cleaning

A regular expression or regex is a set of characters, or a pattern, which is used to find sub strings in a given string like getting urls, numbers, extracting all hashtags and mentions from a tweet from a large unstructured corpus.

In python, re.sub does the job of substituting a detected pattern with a substitute in an input string. The syntax is as follows:

```re.sub(pattern, replacement, input)```

Other options are - findall, fullmatch, split

## Cleaning Tests

In [None]:
test_string = 'I am at https://www.nabanita.org www.nabanita.org okay'
url_pattern = r'(www.|http[s]?://)(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
test_op = re.sub(url_pattern, '', test_string)
test_op

In [None]:
test_string = 'I am at <p>www.nabanita.org &nbsp;</p>'
html_entities = r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});'
test_op = re.sub(html_entities, '', test_string)
test_op

In [None]:
test_string = 'I am at @nabanita #python testing 123'
html_entities = r'@([a-z0-9]+)|#'
test_op = re.sub(html_entities, '', test_string)
test_op

A tweet might tag news channels as well which contains the word 'news'. If they are twitter handles, then the information will be lost. Hence adding this function to add the keyword news to the tweet if the word is present in the tweet text.

## Text Preprocessing Functions

In [None]:
def remove_urls(text):
    ''' This method takes in text to remove urls and website links, if any'''
    url_pattern = r'(www.|http[s]?://)(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    text = re.sub(url_pattern, '', text)
    return text

def remove_html_entities(text):
    ''' This method removes html tags'''
    html_entities = r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});'
    text = re.sub(html_entities, '', text)
    return text

def convert_lower_case(text):
    return text.lower()

def detect_news(text):
    '''Appends news to the end of the tweet if news appears anywhere in the tweet. 
    This is to avoid missing out on the keyword 'news' if it occured in a mention, for ex: @SomeNewsChannel'''
    if 'news' in text: 
        text = text + ' news'
    return text

def remove_social_media_tags(text):
    ''' This method removes @ and # tags'''
    tag_pattern = r'@([a-z0-9]+)|#'
    text = re.sub(tag_pattern, '', text)
    return text

# Count it before I remove them altogether
def count_punctuations(text):
    getpunctuation = re.findall('[.?"\'`\,\-\!:;\(\)\[\]\\/“”]+?',text)
    return len(getpunctuation)

def preprocess_text(x):
    cleaned_text = re.sub(r'[^a-zA-Z\d\s]+', '', x)
    word_list = []
    for each_word in cleaned_text.split(' '):
        word_list.append(contractions.fix(each_word).lower())
    word_list = [wnl.lemmatize(each_word.strip()) for each_word in word_list if each_word not in STOPWORDS and each_word.strip() != '']
    return " ".join(word_list)

In [None]:
df_train[TEXT] = df_train[TEXT].apply(remove_urls)
df_train[TEXT] = df_train[TEXT].apply(remove_html_entities)
df_train[TEXT] = df_train[TEXT].apply(convert_lower_case)
df_train[TEXT] = df_train[TEXT].apply(detect_news)
df_train[TEXT] = df_train[TEXT].apply(remove_social_media_tags)
df_train[PUNCTUATION_COUNT] = df_train[TEXT].apply(count_punctuations)
df_train[TEXT] = df_train[TEXT].apply(preprocess_text)

In [None]:
df_train.head()

In [None]:
# Test

# Expected to remove @FoxNews but have ' news' in the tweet text

df_train[(df_train[KEYWORD] == 'body%20bags') & (df_train[TARGET] == 1)][TEXT].values

## Punctuation Analysis

In [None]:
sns.boxplot(x=TARGET,
            y=PUNCTUATION_COUNT,
            data=df_train,
            palette=['grey', 'red'])
plt.title('Punctuation Analysis')
plt.xticks(labels=['Not Real Disasters', 'Real Disasters'], ticks=[0, 1])
plt.xlabel('Target')
plt.ylabel('Punctuation Count')
plt.show()

## Keyword Check

In [None]:
def clean_keyword(text):
    if text is not np.nan and text:
        text = text.replace('%20', ' ')
    return text

In [None]:
df_train[KEYWORD] = df_train[KEYWORD].apply(clean_keyword)

In [None]:
df_train[KEYWORD].unique()

# Mention of Numbers in Tweets

Information on real disasters usually have a casualty count in numbers. Therefore, it is worth analyzing if numbers are present in the tweets.

In [None]:
def get_numbers_in_tweet(text):
    list_numbers = re.findall(r'\d+', text)
    if list_numbers:
        return 1
    return 0

In [None]:
df_train[NUM_IN_TWEETS] = df_train[TEXT].apply(get_numbers_in_tweet)

In [None]:
df_train.info()

In [None]:
sns.countplot(x=TARGET, hue=NUM_IN_TWEETS, data=df_train)
plt.xticks(labels=['Not Real Disasters', 'Real Disasters'], ticks = [0, 1])
plt.xlabel('Target')
plt.ylabel('Count')
plt.title('Appearance of Numbers in Tweets')
plt.legend(labels=['Numbers Absent', 'Numbers Present'])
plt.show()

About 50% of the tweets about real disatsers have numbers present.

# Sentiment Analysis

In [None]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

In [None]:
df_train[SENTIMENT] = df_train[TEXT].apply(lambda x: nlp(x)._.polarity)

In [None]:
df_train.head()

In [None]:
sns.displot(x=SENTIMENT, hue=TARGET, data=df_train, kde=True)
plt.show()

In [None]:
def sentiment_to_binary(x):
    if x > 0:
        return 1
    else:
        return 0

In [None]:
df_train[SENTIMENT_ROUND] = df_train[SENTIMENT].apply(sentiment_to_binary)
df_train.head()

In [None]:
df_train[SENTIMENT_ROUND].value_counts()

In [None]:
sns.countplot(x=TARGET, hue=SENTIMENT_ROUND, data=df_train)
plt.title('Sentiment Analysis')
plt.xticks(labels=['Not Real Disasters', 'Real Disasters'], ticks=[0, 1])
plt.xlabel('Target')
plt.ylabel('Counts')
# plt.ylabel('Punctuation Count')
plt.legend(labels=['Negative', 'Positive'])
plt.show()

The percentage of negative sentiment is more in tweets about real disasters than in tweets not about real disasters.

# Tweet Length Analysis

Are the tweets about real disasters shorter or longer in nature?

In [None]:
df_train[TEXT_TOKENIZED] = df_train[TEXT].apply(word_tokenize)

In [None]:
df_train.head()

In [None]:
df_train[WORDS_PER_TWEET] = df_train[TEXT_TOKENIZED].apply(len)
df_train[CHAR_PER_TWEET] = df_train[TEXT].apply(len)

In [None]:
sns.histplot(x=WORDS_PER_TWEET, hue=TARGET, data=df_train, kde=True)
plt.title('Tweet Length Analysis - Words')
plt.legend([])
plt.show()
sns.histplot(x=CHAR_PER_TWEET, hue=TARGET, data=df_train, kde=True)
plt.title('Tweet Analysis - Characters')
plt.legend(labels=['Not about Real Disasters', 'About Real Disasters'], loc='best', bbox_to_anchor=(1.1, 0., 0.5, 0.5))
plt.show()

# Tweet Text Analysis using WordCloud

In [None]:
real_disaster_tweets = ' '. join(list(df_train[df_train[TARGET] == 1][TEXT]))

In [None]:
non_real_disaster_tweets = ' '. join(list(df_train[df_train[TARGET] == 0][TEXT]))

In [None]:
wc = WordCloud(background_color="black", 
               max_words=100, 
               width=1000, 
               height=600, 
               random_state=1).generate(real_disaster_tweets)

plt.figure(figsize=(15,15))
plt.imshow(wc)
plt.axis("off")
plt.title("Wordcloud of Tweets about Real Disasters")
plt.show()

In [None]:
wc = WordCloud(background_color="black", 
               max_words=100, 
               width=1000, 
               height=600,
               font_step=1,
               random_state=1).generate(non_real_disaster_tweets)

plt.figure(figsize=(15,15))
plt.imshow(wc)
plt.axis("off")
plt.title("Wordcloud of Tweets NOT about Real Disasters")
plt.show()

Emojis are present in the text, as evident on the wordcloud. Therefore, they need to be either detected or removed. I will not be addressing emoji detection in this project.

# Location Analysis

Do tweets about real disasters have more standard locations which are detectable by NER?

In [None]:
def check_location(x):
    ''' This method checks if the tweet location has any actual location 
    and saves them as as space-separated value if more than one.
    If no location is found, then save blank'''
    spacy_loc = nlp(x)
    num_loc_in_tweet = len([ent.label_ for ent in spacy_loc.ents if ent.label_ == 'GPE'])
    if num_loc_in_tweet:
        locs_in_tweet = [ent.text for ent in spacy_loc.ents if ent.label_ == 'GPE']
    else:
        return [], 0
    return locs_in_tweet, 1

df_train[LOCATION].fillna('', inplace=True)
df_train[LOCATIONS], df_train[IDENTIFIABLE_LOCATION] = zip(*df_train[LOCATION].apply(check_location))

In [None]:
sns.countplot(x=TARGET, hue=IDENTIFIABLE_LOCATION, data=df_train, palette=['grey', 'red'])
plt.legend(labels=['Location Unidentified', 'Location Identified'], loc='best', bbox_to_anchor=(1.1, 0., 0.5, 0.5))
plt.show()

Tweets about real disasters have a slightly higher percentage of location identified by the named-entity identifier.

# Final Text Data Preparation

In [None]:
df_train.head()

In [None]:
df_train[ALL_TEXT] = df_train[TEXT_TOKENIZED] + df_train[LOCATIONS]
df_train.head()

In [None]:
target = df_train[TARGET].values

In [None]:
df_train[ALL_TEXT_JOINED] = df_train[ALL_TEXT].apply(lambda x: " ".join(x))

The training data is ready now. Next step, prepping the test data.

# Test Data Preparation

In [None]:
df_test[TEXT] = df_test[TEXT].apply(remove_urls)
df_test[TEXT] = df_test[TEXT].apply(remove_html_entities)
df_test[TEXT] = df_test[TEXT].apply(convert_lower_case)
df_test[TEXT] = df_test[TEXT].apply(detect_news)
df_test[TEXT] = df_test[TEXT].apply(remove_social_media_tags)
df_test[PUNCTUATION_COUNT] = df_test[TEXT].apply(count_punctuations)
df_test[TEXT] = df_test[TEXT].apply(preprocess_text)

df_test[LOCATION].fillna('', inplace=True)
df_test[LOCATIONS], df_test[IDENTIFIABLE_LOCATION] = zip(*df_test[LOCATION].apply(check_location))

df_test[TEXT_TOKENIZED] = df_test[TEXT].apply(word_tokenize)

df_test[ALL_TEXT] = df_test[TEXT_TOKENIZED] + df_test[LOCATIONS] 
# Keywords not added since the keywords are present in the text anyway

df_test[ALL_TEXT_JOINED] = df_test[ALL_TEXT].apply(lambda x: " ".join(x))

The test data is ready now.

In [None]:
df_test.head()

In [None]:
df_train.head()

In [None]:
# df_train.to_csv('data/df_train_prepped.csv', index=False)
# df_test.to_csv('data/df_test_prepped.csv', index=False)

# Building a Data Modeling Pipeline

**Text Vectorizing** is a common method which converts a sequenceo of text to a sequence of numbers. The sequence of numbers could represent a token or a sentence, depending on your use-case.

Count Vectors and Tf-idf are the most common methods of vectorizing texts.

**Training a ML model** is the next step where the vectorized forms of texts are fed to a model, like logistic regression in this case. After model training, its performance is evaluated.

**Classification Evaluation Metrics**

1. Accuracy
2. Precision
3. Recall
4. F1-score

## Helper Functions

In [None]:
def print_classification_metrics(y_train, train_pred, y_test, test_pred):
    print('Training Accuracy: ', accuracy_score(y_train, train_pred))
    print('Training f1-score: ', f1_score(y_train, train_pred))
    print('Accuracy: ', accuracy_score(y_test, test_pred))
    print('Precision: ', precision_score(y_test, test_pred))
    print('Recall: ', recall_score(y_test, test_pred))
    print('f1-score: ', f1_score(y_test, test_pred))
    
def predict_challenge_test_data(model, test_data, filename):
    submission_predictions = model.predict(test_data)
    df_submission = pd.read_csv('data/sample_submission.csv')
    df_submission[TARGET] = submission_predictions
    df_submission.to_csv(filename, index=False)

## Count Vectorizer

In [None]:
cols_to_train = [ALL_TEXT_JOINED]

tt = TweetTokenizer()

X_train, X_test, y_train, y_test = train_test_split(df_train[cols_to_train],
                                                    df_train[TARGET].values,
                                                    test_size=0.2,
                                                    random_state=42)

ct = ColumnTransformer([('count_vec',
                         CountVectorizer(tokenizer=tt.tokenize,
                                         ngram_range=(1, 2)),
                                         ALL_TEXT_JOINED)],
                       remainder='passthrough')

ct.fit(X_train)
X_train_sparse = ct.transform(X_train)
X_test_sparse = ct.transform(X_test)
df_test_sparse = ct.transform(df_test[cols_to_train])

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train_sparse, y_train)
test_prediction = log_reg.predict(X_test_sparse)
training_prediction = log_reg.predict(X_train_sparse)

In [None]:
print_classification_metrics(y_train, training_prediction, y_test, test_prediction)

## TF-IDF Vectorizer

In [None]:
cols_to_train = [ALL_TEXT_JOINED]

tt = TweetTokenizer()

X_train, X_test, y_train, y_test = train_test_split(df_train[cols_to_train],
                                                    df_train[TARGET].values,
                                                    test_size=0.2,
                                                    random_state=42)

ct = ColumnTransformer([('tfidf',
                         TfidfVectorizer(tokenizer=tt.tokenize,
                                         ngram_range=(1, 2),
                                         smooth_idf=True), ALL_TEXT_JOINED)],
                       remainder='passthrough')

ct.fit(X_train)
X_train_sparse = ct.transform(X_train)
X_test_sparse = ct.transform(X_test)
df_test_sparse = ct.transform(df_test[cols_to_train])

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train_sparse, y_train)
test_prediction = log_reg.predict(X_test_sparse)
training_prediction = log_reg.predict(X_train_sparse)

In [None]:
print_classification_metrics(y_train, training_prediction, y_test, test_prediction)