<a href="https://colab.research.google.com/github/michalis0/DataScience_and_MachineLearning/blob/master/Assignements/Part%205/Assignment_part_five.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

DSML investigation:

You are part of the Suisse Impossible Mission Force, or SIMF for short. You need to uncover a rogue agent that is trying to steal sensitive information.

Your mission, should you choose to accept it, is to find that agent before stealing any classified information. Good luck!

# Assignement part five

### Due 29.10 (You get an extra hour!)

By now you should have 4 suspects left.
More information came in that suggests that the rogue agent is tampering with the sentiment annotation system of the SIMF which analyses news documents and marks their sentiment of intelligence analysis tasks.

This annotation is crutial to identify documents expressing negativity towards Switzerland and its allies.

Each document contains a column which shows which user accessed it. We know that the rogue agent accessed only the documents whose negative sentiment was high, and was then changed to positive or neutral. We will use a huggingface model to identify what records have been tampered with.


[You can find more models on this link](https://huggingface.co/models?sort=trending)


In [1]:
%%capture
!pip install datasets transformers huggingface_hub
!apt-get install git-lfs
!pip install transformers[torch]
!pip install accelerate -U
# Import required packages

# library for huggingface
from transformers import pipeline, DataCollatorWithPadding

# other libraries
import pandas as pd
import numpy as np
import torch
import spacy
from sklearn.model_selection import train_test_split

torch.cuda.is_available()


# 1. Getting to know our data

In [2]:
df = pd.read_excel('https://raw.githubusercontent.com/michalis0/DataScience_and_MachineLearning/master/Assignements/Part%205/data/Reduced_Set_2100.xlsx')

In [3]:
df.head(5)

Unnamed: 0,company,title,news,evaluation,year,month,day
0,APPLE,Tourists snap up British iPads to smuggle into...,IT'S the digital version of the slow boat to C...,negative,2011,4,17
1,CHEVRON,AFTER SEATTLE; Anarchists get organized.,"For Juliette Beck, it began with the story of ...",negative,2000,4,17
2,Exxon Mobil,$10bn oil payout voided,SAN FRANCISCO: An appeal court yesterday voide...,negative,2001,11,9
3,WAL MART STORES,Craft capitalism: Just do it yourself; Web mar...,The declaration from the Handmade Consortium m...,negative,2007,12,15
4,Exxon Mobil,Chevron gas project gets state green light,SYDNEY: Chevron has received final environment...,negative,2007,9,8


In [4]:
df.dtypes

company       object
title         object
news          object
evaluation    object
year           int64
month          int64
day           object
dtype: object

In [5]:
df.shape

(2100, 7)

In [6]:
df["evaluation"].value_counts()

negative    700
neutral     700
positive    700
Name: evaluation, dtype: int64

# 2. Re-evaluating with SIMF's model:
Evaluate the sentiment on the title column using a sentiment pipeline trained on the `finiteautomata/bertweet-base-sentiment-analysis`model


_This may take a while_

In [7]:
# initialisation of the sentiment analysis pipeline based on the model bertweet :

sentiment_pipeline =  pipeline('sentiment-analysis', model='finiteautomata/bertweet-base-sentiment-analysis')

Downloading (…)lve/main/config.json:   0%|          | 0.00/949 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/338 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/843k [00:00<?, ?B/s]

Downloading (…)solve/main/bpe.codes:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

In [8]:
# testing the sentiment model to know it better :

test_pos = sentiment_pipeline("I love this course even if the assignment is sometimes not so clear")
test_neg = sentiment_pipeline("I hate this course")
test_mid = sentiment_pipeline("I don't know what to think about this course but i may be surprised")

print(test_pos)
print(test_neg)
print(test_mid)

[{'label': 'POS', 'score': 0.991771399974823}]
[{'label': 'NEG', 'score': 0.9813860058784485}]
[{'label': 'NEU', 'score': 0.7201648354530334}]


In [9]:
# just to do it in a cleaner way
def longer_sentiment(label):
    if label == 'POS':
        return 'positive'
    elif label == 'NEG':
        return 'negative'
    elif label == 'NEU':
        return 'neutral'

In [10]:
# evaluating the sentiment on the title colum using the sentiment pipeline :
new_evaluation = []

for title in df['title']:
    result = sentiment_pipeline(title)
    cleaner_result = longer_sentiment(result[0]['label']) # using the function that i created above to clean my data :)
    new_evaluation.append(cleaner_result)

# adding the new_eval next to the current evaluations
df.insert(df.columns.get_loc("evaluation") + 1, "new_evaluation", new_evaluation)

In [11]:
# display the new database :

df.head()

Unnamed: 0,company,title,news,evaluation,new_evaluation,year,month,day
0,APPLE,Tourists snap up British iPads to smuggle into...,IT'S the digital version of the slow boat to C...,negative,neutral,2011,4,17
1,CHEVRON,AFTER SEATTLE; Anarchists get organized.,"For Juliette Beck, it began with the story of ...",negative,neutral,2000,4,17
2,Exxon Mobil,$10bn oil payout voided,SAN FRANCISCO: An appeal court yesterday voide...,negative,negative,2001,11,9
3,WAL MART STORES,Craft capitalism: Just do it yourself; Web mar...,The declaration from the Handmade Consortium m...,negative,positive,2007,12,15
4,Exxon Mobil,Chevron gas project gets state green light,SYDNEY: Chevron has received final environment...,negative,neutral,2007,9,8


In [12]:
# check how many values of each there are now (for new_evaluation)

df["new_evaluation"].value_counts()

neutral     1380
negative     384
positive     336
Name: new_evaluation, dtype: int64

## 2.1 How many of the total entries match both the SIMF model **and** the hugginface model?

In [13]:
# we will use the eq() function to answer this question :

is_equal = df['evaluation'].eq(df['new_evaluation'])
matching_table = df[is_equal]
matching_df = matching_table.copy()
matching_entries = is_equal.sum()

print("Number of entries that match : ", matching_entries)
print("In % : ", ((matching_entries / len(df)))) # 913 / 2100

Number of entries that match :  913
In % :  0.43476190476190474


## 2.2 We will now focus on the entries that do not match
#### Identify all non matching entries

In [14]:
# identifying all the non matching entries
does_not_match = df['evaluation'] != df['new_evaluation']

# creating
non_matching_table = df[does_not_match]
non_matching_df = non_matching_table.copy()

# displaying the new df
print("Number of non matching entries :", len(non_matching_df))
non_matching_df.head()

Number of non matching entries : 1187


Unnamed: 0,company,title,news,evaluation,new_evaluation,year,month,day
0,APPLE,Tourists snap up British iPads to smuggle into...,IT'S the digital version of the slow boat to C...,negative,neutral,2011,4,17
1,CHEVRON,AFTER SEATTLE; Anarchists get organized.,"For Juliette Beck, it began with the story of ...",negative,neutral,2000,4,17
3,WAL MART STORES,Craft capitalism: Just do it yourself; Web mar...,The declaration from the Handmade Consortium m...,negative,positive,2007,12,15
4,Exxon Mobil,Chevron gas project gets state green light,SYDNEY: Chevron has received final environment...,negative,neutral,2007,9,8
5,CHEVRON,RIVALS WILL ACCUSE BP,[…] BP was expected to be hung out to dry by r...,negative,neutral,2010,6,15


## 2.3 How many of those entries that our model predicted as negative, are evaluate as neutral or positive by the SIMF model ?

Store the resulting dataframe into a new one that we will be using in the following questions.

In [15]:
# selecting all the negative values predicted in our model (new_evaluation) that have a corresponding positive/neutral evaluation
selected = (non_matching_df['new_evaluation'] == 'negative') & ((non_matching_df['evaluation'] == 'positive') | (non_matching_df['evaluation'] == 'neutral'))

sel = non_matching_df[selected]
altered_df = sel.copy()

print("Number of positive/neutral values that have been predicted as negative in our model : ", len(altered_df))

Number of positive/neutral values that have been predicted as negative in our model :  171


In [16]:
# the new data frame is new_df and looks like that :
print(altered_df.shape)
display(altered_df)

(171, 8)


Unnamed: 0,company,title,news,evaluation,new_evaluation,year,month,day
702,BANK OF AMERICA,Letters: Anger simmers over energy bills; Thes...,"[…] Ending our addiction to fossil fuels, slas...",neutral,negative,2013,10,17
715,VERIZON COMMUNICATIONS,So How Contaminated Is the Old Nuclear Plant?,"[…] From 1952 to 1966, when operations ceased ...",neutral,negative,2002,1,13
722,CITIGROUP,Front: Carbon bubble 'creates global economic ...,The world could be heading for a major economi...,neutral,negative,2013,4,19
733,CITIGROUP,Report says officials are shortsighted big spe...,Government officials are guilty of wasting mon...,neutral,negative,2003,11,27
738,CITIGROUP,Miners to bear cost of climate toll,"RESOURCE companies, along with transporter Tol...",neutral,negative,2006,12,11
...,...,...,...,...,...,...,...,...
2032,JOHNSON & JOHNSON,Energy: Internet hoax raises pressure over emi...,Environmentalists opened up a new front for cl...,positive,negative,2007,12,4
2034,MERCK & COMPANY,TRENDS RAIN-FOREST CHIC Maybe Ben & Jerry'...,[…] Roddick told the Mexican conclave of big-b...,positive,negative,1995,9,29
2037,WAL MART STORES,Vancouver snub 'disappoints' Wal-Mart,VANCOUVER - A day after Vancouver City Council...,positive,negative,2005,6,30
2053,GENERAL ELECTRIC,Climate group seeks legal action over hoax web...,"The US Climate Action Partnership, a broad coa...",positive,negative,2007,12,6


# 3. Use the ChangeLog dataframe to identify the usersID's who edited the tampered entries, and only the altered entries

In [17]:
ChangeLog = pd.read_csv('https://raw.githubusercontent.com/michalis0/DataScience_and_MachineLearning/master/Assignements/Part%205/data/ChangeLog.csv')

In [18]:
# knowing the data
ChangeLog.head()

Unnamed: 0,UserID,title
0,[327047],Tourists snap up British iPads to smuggle into...
1,[401818],Tourists snap up British iPads to smuggle into...
2,[564061],Tourists snap up British iPads to smuggle into...
3,[446376],Tourists snap up British iPads to smuggle into...
4,[242912],AFTER SEATTLE; Anarchists get organized.


In [19]:
ChangeLog.shape

(4169, 2)

## 3.1 Identifying the users who have edited tampered documents

In [20]:
sus_edited_tampered = non_matching_df.merge(ChangeLog, on='title', how='inner')

# just checking how many users we found and how many unique users:
suspects_list = sus_edited_tampered['UserID']
unique_sl_count = suspects_list.nunique()

print("Number of userID found : ", len(suspects_list))
print("Number of unique users : ", unique_sl_count)

Number of userID found :  2481
Number of unique users :  682


## 3.2 Identifying the users who have edited non-tampered documents

In [21]:
sus_edited_nontampered = matching_df.merge(ChangeLog, on='title', how='inner')

# just checking how many users we found and how many unique users:
non_suspects_list = sus_edited_nontampered['UserID']
unique_nsl_count = non_suspects_list.nunique()

print("Number of userID found: ", len(non_suspects_list))
print("Number of unique users : ", unique_nsl_count)

Number of userID found:  1803
Number of unique users :  649


## 3.3 combining the results from `3.1` and `3.2` to identify users who only edited tampered documents.
These are our suspects.

In [22]:
suspects_set = set(suspects_list)
non_suspects_set = set(non_suspects_list)

cleaned_suspects_set = suspects_set - non_suspects_set
len(cleaned_suspects_set)

60

In [23]:
# answering to the question on moodle :

real_suspects_list = [int(suspect[1:-1]) for suspect in cleaned_suspects_set] # asked chatgpt to do that otherwise i had a problem

for suspect in [241540, 754702, 527013, 223968, 152304] :
  if suspect in real_suspects_list :
    print(suspect)

754702
223968


# 4. Identifying important informations on the altered documents.

In this section we will use the TF-IDF text representation model to identify other important information on the altered documents.

----
[note to myself]


Altered document = the data that don't match the value in both model and that have been converted from negative to positive/neutral

In [24]:
# Make a list of the text within articles with the original dataset (the one of section 1)
articles_list = df['news'].tolist()

# checking if the list of articles has been correctly implemented
display(articles_list[0])

print("\n The number of articles is : ", len(articles_list))

"IT'S the digital version of the slow boat to China. The iPad 2, made in China but yet to go on sale there, is being bought in London and smuggled back into the country where it is manufactured.  It means the Apple tablet is creating extra carbon emissions as it travels halfway across the world and back again. Customers are queuing to buy the iPad 2 at Apple shops in London and send them 6,000 miles back to China. The tablets are bought for their retail price of £399 for a 16GB model and then taken to China, where they sell for £430. The profit is not in this small mark-up but in the 20% Vat - in this case almost £80 - Chinese tourists can reclaim when they take the iPads out of the country. No tax is paid to the Chinese authorities by the smugglers. It means a student who takes home six iPads can fund his trip to London. […]"


 The number of articles is :  2100


In [25]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Using default tokenizer in TfidfVectorizer, use the "english" stop words, and unigrams
default_tokenizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 1))

# Learn the vocabulary dictionary and return document-term matrix
tfidf_matrix = default_tokenizer.fit_transform(articles_list)

# Visualize result in dataframe
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=default_tokenizer.get_feature_names_out())

display(tfidf_df)

# just to make sure that there are not only zero-values as the table is really big :
non_zero_count = tfidf_matrix.nnz
print("Number of non-zero values in the TF-IDF matrix :", non_zero_count)

Unnamed: 0,00,000,000b,000billion,000kg,000km,000kwh,000mt,000mwh,000sqkm,...,zoning,zoo,zoologist,zorigt,zoé,zse,zucker,zuckerberg,zune,zurich
0,0.0,0.054857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2095,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2096,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2097,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2098,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Number of non-zero values in the TF-IDF matrix : 179039


In [29]:
# Keep the entries related to tampered documents (in my case : altered_df)
altered_articles_list = altered_df['news'].tolist()

tfidf_altered_matrix = default_tokenizer.transform(altered_articles_list) # here i will only tranform, as i don't want to train the model again.

# getting the features names
feature_names = default_tokenizer.get_feature_names_out()

# Identify the record that stands out the most on the altered documents (you can use the sum of the tokenizers results)
sum_word_scores = tfidf_altered_matrix.sum(axis=0)
sum_word_scores_df = pd.DataFrame(sum_word_scores, columns=feature_names)

# displaying the 4 most important words :
top_words = sum_word_scores_df.iloc[0].sort_values(ascending=False).head(4)

print("Here are the top 4 words of all the altered texts\n", top_words)

Here are the top 4 words of all the altered texts
 carbon     5.759377
said       5.589130
energy     5.010363
climate    4.983437
Name: 0, dtype: float64


In [30]:
# How many records contain the word that stands out the most?
# e.g. if the word that stood out the most was "mouton", how many of the altered records contain the word mouton.
# How about the second word that stands out the most
# How about the third ?
# How about the fourth ?

top_words = top_words.index

# dictionnary init to count the number of occurences
word_counts = {word: 0 for word in top_words}

# for each word, checking and counting how many times it appears in the articles :
for article in altered_articles_list:
    for word in top_words:
        if word in article:
            word_counts[word] += 1

# displaying the number of times that a word appears
for word, count in word_counts.items():
    print("the word ", word, " is found in ", count ," altered articles")


the word  carbon  is found in  56  altered articles
the word  said  is found in  97  altered articles
the word  energy  is found in  52  altered articles
the word  climate  is found in  41  altered articles


#### Moodle quizz: if the order of frequency in appearance, did not match the values assigned by the tokenizer, is it normal?

My answer : yes it is because some words can be more rare than others so it will have a better tf-idf scores than some words that appear more times and that are more popular. For example, the word carbon is really specific, so it's normal that it has a better score than said that is a really common word and can be found everywhere.

<h1> Which users have been suspects for all parts so far? </h1>
Choose those that apply,

You may find on moodle a table summing up the potential suspects thus far.


In [31]:
print(real_suspects_list)

for suspect in [410319, 785994, 638911, 628854, 793674] :
  if suspect in real_suspects_list :
    print(suspect)

# then compare them with the excel file and check if they appear in all parts :)

[263704, 946059, 402663, 411464, 886132, 136015, 156304, 365406, 943792, 942571, 825733, 793674, 645054, 455352, 242361, 442127, 543239, 703326, 795804, 458293, 165305, 108215, 409012, 223968, 200865, 544861, 150642, 628233, 167822, 339524, 376743, 410319, 564884, 676003, 883252, 131191, 902629, 743377, 621836, 541833, 306277, 131393, 355972, 706286, 681209, 267733, 700708, 317991, 173906, 539227, 745000, 387404, 817857, 96249, 785994, 183438, 765508, 628854, 754702, 261521]
410319
785994
628854
793674
