# Day 97 - classification & MultinomialNB

1. Two files are attached to this exercise: <br>
<br>
data_train(97).csv <br>
target_train(97).csv <br>
<br>
The data_train.csv file contains e-mails from two categories: computer graphics (comp.graphics) and space (sci.space). The target_train.csv file contains the labels (0 - comp.graphics, 1 - sci.space). These files was loaded to the following DataFrames: <br>
<br>
data_train <br>
target_train <br>
<br>
Some processing of data_train and target_train was performed. The CountVectorizer class from the scikit-learn package was used to vectorize the text (data_train_vectorized variable). <br>
Using the MultinomialNB class create a text document classification model. Train the model based on the data_train_vectorized and target_train data. <br>
Then classify the following sentences:<br>
'The graphic designer requires a good processor to work.'<br>
'Flights into space.' <br>
In response, print the result to the console as shown below. <br>
Expected result: <br>
<br>
'The graphic designer requires a good processor to work' => comp.graphics <br>
'Flights into space' => sci.space

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
 
 
data_train = pd.read_csv('data_train(97).csv')
target_train = pd.read_csv('target_train(97).csv')
 
categories = ['comp.graphics', 'sci.space']
 
data_train = data_train['text'].tolist()
target_train = target_train.values.ravel()
 
vectorizer = CountVectorizer()
data_train_vectorized = vectorizer.fit_transform(data_train)
 
classifier = MultinomialNB()
classifier.fit(data_train_vectorized, target_train)
 
docs = [
    'The graphic designer requires a good processor to work',
    'Flights into space',
]
data_new = vectorizer.transform(docs)
 
data_pred = classifier.predict(data_new)
 
for doc, category in zip(docs, data_pred):
    print(f'\'{doc}\' => {categories[category]}')

'The graphic designer requires a good processor to work' => comp.graphics
'Flights into space' => sci.space


2. Two files are attached to this exercise: <br>
<br>
data_train.csv <br>
target_train.csv <br>
<br>
The data_train.csv file contains e-mails from two categories: computer graphics (comp.graphics) and space (sci.space). The target_train.csv file contains the labels (0 - comp.graphics, 1 - sci.space). These files was loaded to the following DataFrames: <br>
<br>
data_train <br>
target_train <br>
<br>
Some processing of data_train and target_trian was performed. Using the TfidfVectorizer class from the scikit-learn package, vectorize the text from the data_train list and assign it to the data_train_vectorized variable. In response, print the shape of the sparse matrix obtained in this way to the console.<br>
<br>
Expected result: <br>
(50, 3225)

In [3]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
 
 
data_train = pd.read_csv('data_train(97).csv')
target_train = pd.read_csv('target_train(97).csv')
 
categories = ['comp.graphics', 'sci.space']
 
data_train = data_train['text'].tolist()
target_train = target_train.values.ravel()
 
vectorizer = TfidfVectorizer()
data_train_vectorized = vectorizer.fit_transform(data_train)
print(data_train_vectorized.shape)

(50, 3225)
