# Group Project

As a Data Scientist, you are tasked to help these users find the most interesting articles
according to their preferred topics. You have a ***training dataset containing about 9500 news
articles, each assigned to one of the above topics***. In addition, (as in real life situation) the
dataset contains about ***48% of irrelevant articles*** (marked as IRRELEVANT) that do not
belong to any of the topics; hence the users are not interested in them. The distribution of
articles over topics is not uniform. There are some topics with large number of articles, and
some with very small number.

One day, 500 new articles have been published. This is your test set that has similar article
distribution over topics to the training set. ***Your task is to suggest up to 10 of the most relevant
articles from this set of 500 to each user***. The number of suggestions is limited to 10, because,
presumably, the users do not want to read more suggestions. It is possible, however, that some
topics within this test set have less than 10 articles. You also do not want to suggest 10 articles
if they are unlikely to be relevant, because you are concerned that the users may get
discouraged and stop using your application altogether. Therefore you need to take a balanced
approach, paying attention to not suggesting too many articles for rare topics.

### Import Library

In [1]:
# Load libraries
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import collections


### Pre-processing

In [2]:
# Import text data
raw_training = pd.read_csv("training.csv")
raw_testing = pd.read_csv("test.csv")

# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(raw_training["article_words"])

# Create feature matrix
X = bag_of_words

# Create bag of words
y = raw_training["topic"]

#######################Resampling Dataset#######################


# Reducing the effect imbalnced by deleting some irrelevant class
# "Irrelevant" classe has 4734 samples in the training data, try to reduce it into 2000
irrelevant = raw_training[raw_training["topic"] == "IRRELEVANT"]
remove_n = 2734
drop_indices = np.random.choice(irrelevant.index, remove_n, replace=False)
irrelevant = irrelevant.drop(drop_indices)

reduce_training =  pd.concat([raw_training[raw_training["topic"] != "IRRELEVANT"], irrelevant],ignore_index=True)
reduce_bag_of_words = count.fit_transform(reduce_training["article_words"])
R_X = reduce_bag_of_words
R_y = reduce_training["topic"]

# Icreasing the minor classes
# Increasing 
topic_class = raw_training[raw_training["topic"] != "IRRELEVANT"]
increase_training = pd.concat([topic_class, topic_class, topic_class, raw_training[raw_training["topic"] == "IRRELEVANT"]], ignore_index=True)
increase_bag_of_words = count.fit_transform(increase_training["article_words"])
I_X = increase_bag_of_words
I_y = increase_training["topic"]


In [3]:
y

0       FOREX MARKETS
1       MONEY MARKETS
2              SPORTS
3       FOREX MARKETS
4          IRRELEVANT
            ...      
9495          DEFENCE
9496       IRRELEVANT
9497    FOREX MARKETS
9498       IRRELEVANT
9499    FOREX MARKETS
Name: topic, Length: 9500, dtype: object

In [4]:
#print('Features:' , count.get_feature_names())# 檢視feature names
#print('Values: \n', X.toarray())

### Defining Method
Defining different model function to be used later in the k fold validation


In [5]:
# Using specific model "method", return specifc score "score" by cross validation

def Model_Score (X, y, method, k):
    clf = method
    
    accuracy_scores = cross_val_score(clf, X, y, cv=k, scoring="accuracy")
    precision_scores = cross_val_score(clf, X, y, cv=k, scoring="precision_macro")
    recall_scores = cross_val_score(clf, X, y, cv=k, scoring="recall_macro")
    f1_scores = cross_val_score(clf, X, y, cv=k, scoring="f1_macro")
    
    return np.mean(accuracy_scores), np.mean(precision_scores), np.mean(recall_scores), np.mean(f1_scores)

#def Model_report ()
    


### Training Model (Naive Bayes)
Using K-fold validation to split the training data and validation data. Use the average score of the validation sets to evaluate the performance of the model

In [6]:
# creating 10 fold for k-fold validation
#A = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [5, 6], [7, 2], [5, 4], [7, 5], [10, 4]]) 
b = np.array([1, 2, 3, 4,5,6,7,8,9,10])

# k-fold split number 10 
k = 10

# without doing any data cleaning

bernoulliNB_accuracy, bernoulliNB_precision, bernoulliNB_recall, bernoulliNB_f1 = Model_Score(X, y, BernoulliNB(), 10)
multinomialNB_accuracy, multinomialNB_precision, multinomialNB_recall, multinomialNB_f1 = Model_Score(X, y, MultinomialNB(), 10)


# multinomialNB using uniformed distribution
multiNB_accuracy2, multiNB_precision2, multiNB_recall2, multiNB_f12 = Model_Score(X, y, MultinomialNB(fit_prior = False), 10)


# reduce irrelevant samples
R_accuracy, R_precision, R_recall, R_f1 = Model_Score(R_X, R_y, MultinomialNB(), 10)


# Icreasing the minor classes
I_accuracy, I_precision, I_recall, I_f1 = Model_Score(I_X, I_y, MultinomialNB(), 10)


# Try Classification report
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20, random_state=42)

clf_BernoulliNB = BernoulliNB()
model_BernoulliNB = clf_BernoulliNB.fit(X_train, y_train)

clf_MultinomialNB = MultinomialNB()
model_MultinomialNB = clf_MultinomialNB.fit(X_train, y_train)


predicted_BernoulliNB = model_BernoulliNB.predict(X_valid)
predicted_MultinomialNB = model_MultinomialNB.predict(X_valid)

I_X_train, I_X_valid, I_y_train, I_y_valid = train_test_split(I_X, I_y, test_size=0.20, random_state=42)
model_MultinomialNB_2 = clf_MultinomialNB.fit(I_X_train, I_y_train)
predicted_MultinomialNB_2 = model_MultinomialNB_2.predict(I_X_valid)





# still need to deal with 


# 1. irrelevant articles

# 2. The distribution of topics are not uniform

# 3. select the features

# 4. maybe can give penalty to the misclassifying




# multinomialNB_accuracy = Model_Score(X, y, MultinomialNB(), 10, "accuracy")


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

In [7]:
print("====================================================================================")
print("Without doing any data cleaning, the score of bernoulliNB,\naccuracy:  " + str(bernoulliNB_accuracy) +
     "\nprecision: " + str(bernoulliNB_precision) + "\nrecall:    " + str(bernoulliNB_recall) + "\nf1:        " +
     str(bernoulliNB_f1))
#print("\nClassification Report for bernoulliNB:\n")
#print(classification_report(y_valid, predicted_BernoulliNB))

print("====================================================================================")

print("Without doing any data cleaning, the score of multinomialNB,\naccuracy:  " + str(multinomialNB_accuracy) +
     "\nprecision: " + str(multinomialNB_precision) + "\nrecall:    " + str(multinomialNB_recall) + "\nf1:        " +
     str(multinomialNB_f1))
#print("\nClassification Report for multinomialNB:\n")
#print(classification_report(y_valid, predicted_MultinomialNB))

print("====================================================================================")

print("Setting uniformed prior, the score of multinomialNB,\naccuracy:  " + str(multiNB_accuracy2) +
     "\nprecision: " + str(multiNB_precision2) + "\nrecall:    " + str(multiNB_recall2) + "\nf1:        " +
     str(multiNB_f12))
#print("\nClassification Report for multinomialNB:\n")
#print(classification_report(y_valid, predicted_MultinomialNB))

print("====================================================================================")

print("Reduce the case of irrelevant, the score of multinomialNB,\naccuracy:  " + str(R_accuracy) +
     "\nprecision: " + str(R_precision) + "\nrecall:    " + str(R_recall) + "\nf1:        " +
     str(R_f1))

print("====================================================================================")

print("Increase(Copy) the case of topic classes, the score of multinomialNB,\naccuracy:  " + str(I_accuracy) +
     "\nprecision: " + str(I_precision) + "\nrecall:    " + str(I_recall) + "\nf1:        " +
     str(I_f1))

print("\nClassification Report for multinomialNB after Resampling Dataset:\n")
print(classification_report(I_y_valid, predicted_MultinomialNB_2))



Without doing any data cleaning, the score of bernoulliNB,
accuracy:  0.7038942167520311
precision: 0.37212369276862634
recall:    0.2819154032170111
f1:        0.28388746010825894
Without doing any data cleaning, the score of multinomialNB,
accuracy:  0.7358161832731782
precision: 0.6306736067435463
recall:    0.5526188603489018
f1:        0.5583352913640214
Setting uniformed prior, the score of multinomialNB,
accuracy:  0.7338169401702015
precision: 0.6325643464367477
recall:    0.5709569773293275
f1:        0.5712769594923721
Reduce the case of irrelevant, the score of multinomialNB,
accuracy:  0.7126879007867363
precision: 0.7162607413983089
recall:    0.6008300451444473
f1:        0.6215033433885658
Increase(Copy) the case of topic classes, the score of multinomialNB,
accuracy:  0.8046992057147376
precision: 0.8135680332176245
recall:    0.8751822879751108
f1:        0.8346417211219871

Classification Report for multinomialNB after Resampling Dataset:

                            

In [62]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score

proba_y = model_MultinomialNB_2.predict_proba(X_valid)
print(model_MultinomialNB_2.classes_)
count = 0
for i in range(len(proba_y)):
    print(proba_y[i])
    print(predicted_MultinomialNB_2[i])

    #proba = model_BernoulliNB.classes_[np.argmax(proba_y[i])]
    proba = predicted_MultinomialNB_2[i]
    true = I_y_valid.array[i]
    print("proba ", proba)
    print("true  ", true)
    if(proba == true):
        count+=1
print(count, len(proba_y), count/len(proba_y))


['ARTS CULTURE ENTERTAINMENT' 'BIOGRAPHIES PERSONALITIES PEOPLE' 'DEFENCE'
 'DOMESTIC MARKETS' 'FOREX MARKETS' 'HEALTH' 'IRRELEVANT' 'MONEY MARKETS'
 'SCIENCE AND TECHNOLOGY' 'SHARE LISTINGS' 'SPORTS']
[4.95289135e-119 4.17801375e-092 2.30348335e-042 1.96793736e-168
 7.46004734e-008 1.62643568e-135 9.99999925e-001 1.48667582e-038
 1.05144279e-207 2.25688391e-194 2.97681721e-165]
FOREX MARKETS
proba  FOREX MARKETS
true   FOREX MARKETS
[7.05480719e-40 5.72278796e-42 1.47761474e-41 2.22835084e-35
 4.27772034e-09 9.65960904e-44 1.14334572e-12 9.99999996e-01
 1.13879457e-44 5.94513757e-32 1.12228694e-44]
FOREX MARKETS
proba  FOREX MARKETS
true   FOREX MARKETS
[1.97480581e-074 3.33431192e-086 2.78754753e-091 4.06766285e-102
 9.08034987e-116 1.12752061e-094 8.62533658e-091 2.32039091e-121
 7.84087455e-099 3.49031308e-103 1.00000000e+000]
MONEY MARKETS
proba  MONEY MARKETS
true   FOREX MARKETS
[2.10432023e-38 1.99999372e-38 1.18131612e-46 3.62258757e-47
 3.57987925e-56 2.08695647e-45 1.5357133

true   MONEY MARKETS
[3.77692156e-52 1.25211634e-43 9.66020484e-32 1.57659280e-43
 9.92939155e-01 6.73754657e-48 5.74928895e-09 7.06083929e-03
 6.10462149e-61 4.03630488e-38 6.67752618e-62]
DOMESTIC MARKETS
proba  DOMESTIC MARKETS
true   IRRELEVANT
[1.56902689e-077 2.64284694e-065 3.53082224e-081 1.31828348e-109
 3.26900850e-103 1.22131612e-090 5.08798914e-082 3.38385990e-111
 2.19798066e-088 3.49503845e-090 1.00000000e+000]
SPORTS
proba  SPORTS
true   SPORTS
[5.39095996e-083 1.72222109e-036 1.00000000e+000 6.90138547e-133
 2.32852133e-159 3.59902483e-059 2.34042415e-072 3.16027552e-178
 2.16839620e-114 4.85045359e-154 3.45544295e-109]
IRRELEVANT
proba  IRRELEVANT
true   DEFENCE
[7.02797965e-63 1.98486127e-59 9.86515823e-71 1.42362237e-72
 7.19602342e-78 2.58200471e-73 1.74453395e-57 8.03440322e-84
 1.02398344e-64 4.18679212e-60 1.00000000e+00]
SPORTS
proba  SPORTS
true   SPORTS
[2.10347663e-136 8.86486079e-122 1.26149287e-084 7.43969196e-090
 9.85635329e-121 2.23032266e-140 1.00000000

 6.06259160e-71 5.08114381e-27 1.08219957e-63]
FOREX MARKETS
proba  FOREX MARKETS
true   FOREX MARKETS
[7.34763795e-104 4.03163513e-048 3.31249691e-030 6.38725464e-153
 7.10325476e-110 1.50699351e-111 1.00000000e+000 1.75863647e-147
 1.01252993e-132 8.67222107e-138 1.27906752e-094]
DEFENCE
proba  DEFENCE
true   DEFENCE
[1.05971857e-38 1.73563860e-36 2.58243976e-33 9.10044702e-35
 5.85275762e-01 1.99632792e-38 3.47506998e-15 4.14724238e-01
 1.14498123e-44 1.22521843e-21 1.50692785e-49]
MONEY MARKETS
proba  MONEY MARKETS
true   MONEY MARKETS
[7.63471679e-46 9.30761326e-50 4.54656638e-57 9.15168155e-45
 7.28027598e-01 9.63180391e-54 7.06307065e-19 2.71972402e-01
 2.49154486e-67 4.01094128e-35 1.61757055e-46]
MONEY MARKETS
proba  MONEY MARKETS
true   MONEY MARKETS
[2.24806064e-158 3.95767147e-143 2.38248454e-138 9.35558156e-126
 8.24010438e-001 1.98681949e-129 6.16681563e-030 1.75989562e-001
 2.53109658e-191 8.70615239e-119 1.63150641e-163]
HEALTH
proba  HEALTH
true   HEALTH
[6.56619926e-4

SPORTS
proba  SPORTS
true   SPORTS
[3.36252688e-176 3.00902615e-164 1.07205259e-168 2.37043793e-136
 5.22013773e-018 1.91116678e-141 3.04772583e-057 1.00000000e+000
 1.23470693e-182 1.14918561e-127 1.76712197e-186]
SPORTS
proba  SPORTS
true   SPORTS
[6.79695988e-42 6.41729112e-38 5.69769279e-38 4.29732973e-38
 8.39578589e-01 1.14942081e-38 5.97684688e-19 1.60421411e-01
 3.03375807e-41 1.42235576e-23 3.04654985e-40]
IRRELEVANT
proba  IRRELEVANT
true   IRRELEVANT
[1.92993590e-117 3.21489333e-109 5.00145668e-113 1.63562597e-093
 8.34970708e-007 2.34900447e-099 1.69976908e-015 9.99999164e-001
 1.08567912e-121 5.71572235e-010 8.57171691e-115]
MONEY MARKETS
proba  MONEY MARKETS
true   IRRELEVANT
[9.25974319e-21 6.50160574e-17 1.16940228e-24 7.24156136e-23
 8.15502227e-33 1.57995891e-22 6.84746323e-24 1.86663982e-36
 2.34736032e-24 1.44142752e-19 1.00000000e+00]
SPORTS
proba  SPORTS
true   SPORTS
[2.13647936e-141 6.41747056e-133 2.16518610e-144 4.10287846e-064
 9.57880390e-001 3.26894055e-132

SPORTS
proba  SPORTS
true   SPORTS
[7.11186353e-12 1.29008900e-12 1.17228292e-09 2.14954431e-14
 4.45194559e-10 5.92335371e-08 9.92983032e-01 1.84312531e-09
 1.90364120e-10 7.01690473e-03 1.11337758e-17]
FOREX MARKETS
proba  FOREX MARKETS
true   FOREX MARKETS
[5.16038725e-92 3.84895610e-94 8.98376749e-81 3.55707723e-98
 1.00000000e+00 3.21359232e-87 7.14428823e-40 1.62496094e-11
 3.06616418e-96 1.03153492e-73 2.40257381e-80]
DOMESTIC MARKETS
proba  DOMESTIC MARKETS
true   IRRELEVANT
[5.96155563e-15 2.62514746e-14 4.28563297e-08 2.06724878e-35
 2.49120602e-48 2.81060644e-11 9.99999957e-01 1.53027901e-59
 2.48744165e-19 3.73091544e-37 4.89775664e-15]
FOREX MARKETS
proba  FOREX MARKETS
true   MONEY MARKETS
[3.04171929e-56 3.03899661e-61 3.00683990e-60 1.45850839e-29
 5.73914398e-27 6.54278633e-50 1.00000000e+00 2.25752401e-29
 8.27971067e-73 1.17913196e-14 1.88389970e-69]
IRRELEVANT
proba  IRRELEVANT
true   IRRELEVANT
[4.21422870e-061 6.66207592e-059 2.55068263e-073 8.43448815e-076
 9.983

FOREX MARKETS


### Final Test Try

### Training Model (Decision Tree)
Using K-fold validation to split the training data and validation data. Use the average score of the validation sets to evaluate the performance of the model


In [32]:
decisionTree_accuracy, decisionTree_precision, decisionTree_recall, decisionTree_f1 = Model_Score(X, y, DecisionTreeClassifier(), 10)


clf_decisionTree = DecisionTreeClassifier()
model_decisionTree = clf_decisionTree.fit(X_train, y_train)

predicted_decisionTree = model_decisionTree.predict(X_valid)






In [33]:
print("\n====================================================================================\\n")
print("Without doing any data cleaning, the score of decisionTree,\naccuracy:  " + str(decisionTree_accuracy) +
     "\nprecision: " + str(decisionTree_precision) + "\nrecall:    " + str(decisionTree_recall) + "\nf1:        " +
     str(decisionTree_f1))
print("\nClassification Report for decisionTree:\n")
print(classification_report(y_valid, predicted_decisionTree))


Without doing any data cleaning, the score of decisionTree,
accuracy:  0.6996523800897622
precision: 0.5381255307309976
recall:    0.49843501674178625
f1:        0.5162767074832546

Classification Report for decisionTree:

                                  precision    recall  f1-score   support

      ARTS CULTURE ENTERTAINMENT       0.64      0.32      0.42        22
BIOGRAPHIES PERSONALITIES PEOPLE       0.31      0.21      0.25        39
                         DEFENCE       0.56      0.45      0.50        44
                DOMESTIC MARKETS       0.60      0.56      0.58        27
                   FOREX MARKETS       0.40      0.40      0.40       174
                          HEALTH       0.47      0.35      0.40        49
                      IRRELEVANT       0.78      0.82      0.80       909
                   MONEY MARKETS       0.59      0.60      0.60       344
          SCIENCE AND TECHNOLOGY       0.45      0.28      0.34        18
                  SHARE LISTINGS   

In [9]:
collections.Counter(y)

Counter({'FOREX MARKETS': 845,
         'MONEY MARKETS': 1673,
         'SPORTS': 1102,
         'IRRELEVANT': 4734,
         'SHARE LISTINGS': 218,
         'BIOGRAPHIES PERSONALITIES PEOPLE': 167,
         'DOMESTIC MARKETS': 133,
         'DEFENCE': 258,
         'SCIENCE AND TECHNOLOGY': 70,
         'HEALTH': 183,
         'ARTS CULTURE ENTERTAINMENT': 117})