# **CMT309 - Computational Data Science - Data Science Portfolio**

# Part 1 - Text Data Analysis (45 marks)

In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

In this exercise, you are asked to perform a number of operations to gain insights from the data.

## P1.0) Suggested/Required Imports

In [None]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_22.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))
df = pd.read_csv('data_portfolio_22.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_22.csv


In [None]:
df.head()

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599


## P1.1 - Text data processing (20 marks)

### P1.1.1 - Offensive authors per subreddit (5 marks)

As you will see, the dataset contains a lot of strings of the form `[***]`. These have been used to mask (or remove) swearwords to make it less offensive. We are interested in finding those users that have posted at least one swearword in each subreddit. We do this by counting occurrences of the `[***]` string in the `selftext` column (we can assume that an occurrence of `[***]` equals a swearword in the original dataset).

**What to implement:** A function `offensive_authors(df)` that takes as input the original dataframe and returns a dataframe of the form below, where each row contains authors that posted at least one swearword in the corresponding subreddit.

```
subreddit	author
0	40kLore	Cross_Ange
1	40kLore	DaRandomGitty2
2	40kLore	EMB1981
3	40kLore	Evoxrus_XV
4	40kLore	Grtrshop
...
```

In [None]:
def offensive_authors(df):
    # Group the dataframe by subreddit and author, and count the number of occurrences of "[***]"
    counts = df[df["selftext"].str.contains("\[\*\*\*\]")].groupby(["subreddit", "author"]).size().reset_index(name="count")
    
    # Get the list of unique subreddits
    subreddits = df["subreddit"].unique()
    
    # Create an empty list to store the results
    result = []
    
    # Iterate over the subreddits
    for subreddit in subreddits:
        # Get the list of authors that posted in the current subreddit
        authors = df[df["subreddit"] == subreddit]["author"].unique()
        
        # Filter the counts dataframe to include only the current subreddit and the authors that posted in it
        subreddit_counts = counts[counts["subreddit"] == subreddit]
        subreddit_counts = subreddit_counts[subreddit_counts["author"].isin(authors)]
        
        # Check if all the authors posted at least one swearword in the current subreddit
        if (subreddit_counts["count"] > 0).all():
            # Append the authors to the result list
            result.append(pd.DataFrame({"subreddit": [subreddit] * len(authors), "author": authors}))
    
    # Concatenate the dataframes in the result list and return the concatenated dataframe
    result = pd.concat(result, ignore_index=True)
    return result


In [None]:
offensive_authors(df)

Unnamed: 0,subreddit,author
0,donaldtrump,-Howitzer-
1,donaldtrump,-mylankovic-
2,donaldtrump,0ldManFrank
3,donaldtrump,11_gop
4,donaldtrump,AddictedReddit
...,...,...
2792,TheVampireDiaries,eli454
2793,TheVampireDiaries,kay278
2794,TheVampireDiaries,maverickbluezero
2795,criminalminds,eli454


### P1.1.2 - Most common trigrams per subreddit (15 marks)

We are interested in learning about _the ten most frequent trigrams_ (a [trigram](https://en.wikipedia.org/wiki/Trigram) is a sequence of three consecutive words) in each subreddit's content. You must compute these trigrams on both the `selftext` and `title` columns. Your task is to generate a Python dictionary of the form:

```
{subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
...
subreddit63: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],}
```

That is, for each subreddit, the 10 most frequent trigrams and their frequency, stored in a list of tuples. Each trigram will be stored also as a tuple containing 3 strings.

**What to implement**: A function `get_tris(df, stopwords_list, punctuation_list)` that will take as input the original dataframe, a list of stopwords and a list of punctuation signs (e.g., `?` or `!`), and will return a python dictionary with the above format. Your function must implement the following steps in order:

- (**1 mark**) Create a new dataframe called `newdf` with only `subreddit`, `title` and `selftext` columns.
- (**1 mark**) Add a new column to `newdf` called `full_text`, which will contain `title` and `selftext` concatenated with the string `.` (a full stop) followed by a space. That, is `A simple title` and `This is a text body` would be `A simple title. This is a text body`.
- (**1 mark**) Remove all occurrences of the following strings from `full_text`. You must do this without creating a new column:
  - `[***]`
  - `&amp;`
  - `&gt;`
  - `https`
- (**1 mark**) You must also remove all occurrences of at least three consecutive hyphens, for example, you should remove strings like `---`, `----`, `-----`, etc., but not `--` and not `-`.
- (**1 mark**) Tokenize the contents of the `full_text` column after lower casing (removing all capitalization). You should use the `word_tokenize` function in `nltk`. Add the results to a new column called `full_text_tokenized`.
- (**2 mark**) Remove all tokens that are either stopwords or punctuation from `full_text_tokenized` and store the results in a new column called `full_text_tokenized_clean`. _See Note 1_.
- (**2 marks**) Create a new dataframe called `adf` (which will stand for _aggregated dataframe_), which will have one row per subreddit (i.e., 63 rows), and will have two columns: `subreddit` (the subreddit name), and `all_words`, which will be a big list with all the words that belong to that subreddit as extracted from the `full_text_tokenized_clean`.
- (**3 marks**) Obtain trigram counts, which will be stored in a dictionary where each `key` will be a trigram (a `tuple` containing 3 consecutive tokens), and each `value` will be their overall frequency in that subreddit. You are  encouraged to use functions from the `nltk` package, although you can choose any approach to solve this part.
- (**3 marks**) Finally, use the information you have in `adf` for generating the desired dictionary, and return it. _See Note 2_.

Note 1. You can obtain stopwords and punctuation as follows.
- Stopwords: 
```
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
```
- Punctuation:
```
import string
punctuation = list(string.punctuation)
```

Note 2. You do not have to apply an additional ordering when there are several trigrams with the same frequency.

In [None]:
# necessary imports here for extra clarity
from nltk.corpus import stopwords as sw
import string
import warnings

def get_tris(df, stopwords_list, punctuation_list):
   
    # 1 MARK - create new df with only relevant columns
    newdf = df[['subreddit', 'title', 'selftext']].copy()

    # 1 MARK - concatenate title and selftext
    newdf['full_text'] = newdf['title'] + '. ' + newdf['selftext']

    # 1 MARK for string replacement
    newdf['full_text'] = newdf['full_text'].str.replace('\n', ' ')

    # 1 MARK for regex replacement - remove the strings "[***]", "&amp;", "&gt;" and "https", also at least three consecutive dashes
    newdf['full_text'] = newdf['full_text'].str.replace('\[.*?\]|\&\w+;|\&\#?\w+;|https?|[-]{3,}', '', regex=True)

    # 1 MARK - lower case, tokenize, and add result to full_text_tokenize
    newdf['full_text_tokenize'] = newdf['full_text'].str.lower().apply(word_tokenize)

    # 2 MARKS - clean the full_text_tokenized column by iterating over each word and discarding if it's either a stopword or punctuation
    newdf['full_text_tokenized_clean'] = newdf['full_text_tokenize'].apply(lambda x: [word for word in x if word not in stopwords_list and word not in punctuation_list])

    # 2 MARKS - create new aggregated dataframe by concatenating all full_text_tokenized_clean values - rename columns as requested
    adf = newdf.groupby('subreddit')['full_text_tokenized_clean'].agg(lambda x: [item for sublist in x for item in sublist]).reset_index()
    adf.columns = ['subreddit', 'all_words']

    # 3 MARKS - create new Series object by piping nltk's FreqDist and trigrams functions into all_words
    tri_counts = adf['all_words'].apply(lambda x: list(nltk.FreqDist(nltk.trigrams(x)).items()))

    # 3 MARKS - create output dictionary by zipping subreddit column from adf and tri_counts into a list of tuples, then passing dict()
    # the top 10 most frequent ngrams are obtained by calling sorted() on tri_counts and keeping only the top 10 elements
    output_dict = dict(zip(adf['subreddit'], tri_counts.apply(lambda x: sorted(x, key=lambda y: y[1], reverse=True)[:10])))
    
    return output_dict


In [None]:
 #get stopwords as list
sw = sw.words('english')
# get punctuation as list
p = list(string.punctuation)
# optional lines for adding the below line to avoid the SettingWithCopyWarning
warnings.filterwarnings('ignore')
get_tris(df, sw, p)



{'40kLore': [(('whose', 'bolter', 'anyway'), 8),
  (('started', 'us', 'examples'), 8),
  (('space', 'marine', 'chapter'), 7),
  (('kabal', 'black', 'heart'), 7),
  (('lo', '’', 'tos'), 7),
  (('die', 'paragon', 'knights'), 6),
  (('dark', 'age', 'technology'), 4),
  (('let', "'s", 'say'), 4),
  (('``', 'star', 'claimers'), 4),
  (('star', 'claimers', "''"), 4)],
 'AMD_Stock': [(('created', 'subreddit', 'reddit'), 10),
  (('subreddit', 'reddit', 'posts'), 10),
  (('open', 'redditors', 'posts'), 10),
  (('redditors', 'posts', 'well'), 10),
  (('posts', 'well', 'please'), 10),
  (('well', 'please', 'consider'), 10),
  (('please', 'consider', 'subscribing'), 10),
  (('consider', 'subscribing', 'find'), 10),
  (('subscribing', 'find', 'posts'), 10),
  (('find', 'posts', 'helpful'), 10)],
 'Anki': [(('``', 'conditional', "''"), 7),
  (('``', 'field2', "''"), 5),
  (('\\^conditional', 'field1', '/conditional'), 5),
  (('conditional', "''", 'filled'), 4),
  (('font-family', 'simplified', 'arab

## P1.2 - Answering questions with pandas (15 marks)

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Authors that post highly commented posts (3 marks)

Find the top 1000 most commented posts. Then, obtain the names of the authors that have at least 3 posts among these posts.

**What to implement:** Implement a function `find_popular_authors(df)` that takes as input the original dataframe and returns a list strings, where each string is the name of authors that satisfy the above criteria.

In [None]:
def find_popular_authors(df):
    # Sort the posts by number of comments in descending order
    sorted_df = df.sort_values('num_comments', ascending=False)

    # Take the top 1000 most commented posts
    top_1000_df = sorted_df.head(1000)

    # Count the number of posts per author
    author_counts = top_1000_df['author'].value_counts()

    # Filter the authors that have at least 3 posts among the top 1000
    popular_authors = author_counts[author_counts >= 3].index.tolist()

    return popular_authors

In [None]:
find_popular_authors(df)

['AutoModerator',
 'r[***]og',
 'jigsawmap',
 'Salramm01',
 'HippolasCage',
 'FunPeach0',
 'iSlingShlong',
 'Stoaticor',
 'kevinmrr',
 'ratioetlogicae',
 'None',
 'harushiga',
 'tefunka',
 'SlobBarker',
 'stargem5',
 'AristonD',
 'werdmouf',
 'Cross_Ange',
 'samzz41',
 'itsreallyreallytrue',
 'SUPERGUESSOUS',
 'Frocharocha',
 'habichuelacondulce',
 'CantStopPoppin',
 'Allstarhit',
 'theitguyforever',
 'rebooted_life_42',
 'Zhana-Aul',
 'Not4Reel',
 'Jellyrollrider',
 'NYLaw',
 'MakeItRainSheckels',
 'TurtleFacts72',
 'Defie-LOH-Gic',
 'Typoqueen00',
 'imagepoem',
 'nycsellit4me',
 'madman320',
 'mythrowawaybabies',
 'kogeliz',
 'strngerdngermaus',
 'Kinmuan',
 'AllisonGator',
 'Antiliani',
 'vizard673',
 'notpreposterous',
 'BanDerUh',
 'dukey',
 'BebeFanMasterJ',
 'Fr1sk3r',
 'Gambit08',
 'XDitto',
 'elt0p0',
 'twistedlogicx',
 'TAKEitTOrCIRCLEJERK',
 'Ramy_',
 'tacolben',
 'Morihando',
 '2020c[***]er[***]',
 'dunphish64',
 'apocalypticalley',
 'dsbwayne',
 'schuey_08',
 'blacked_love

### P1.2.2 - Distribution of posts per weekday (5 marks)

Find the percentage of posts that were posted in each weekday (Monday, Tuesday, etc.). You can use an external calendar or you can use any functionality for dealing with dates available in pandas. 

**What to implement:** A function `get_weekday_post_distribution(df)` that takes as input the original dataframe and returns a dictionary of the form (the values are made up):

```
{'Monday': '14%',
'Tuesday': '23%', 
...
}
```

Note that you must only return two decimals, and you must include the percentage sign in the output dictionary. 

Note that in dictionaries order is not preserved, so the order in which it gets printed will not matter. 

In [None]:
def get_weekday_post_distribution(df):
  # your answer here
  # Convert 'created_utc' column to datetime format
  df['posted_at'] = pd.to_datetime(df['posted_at'], infer_datetime_format=True)

  # Extract weekday information
  df['weekday'] = df['posted_at'].dt.day_name()

  # Count the number of posts for each weekday
  weekday_counts = df['weekday'].value_counts()

  # Calculate the percentage of posts for each weekday
  total_posts = weekday_counts.sum()
  weekday_percentages = (weekday_counts / total_posts * 100).round(2)

  # Create the output dictionary
  output_dict = {}
  for weekday in weekday_percentages.index:
      percentage_str = '{:.2f}%'.format(weekday_percentages[weekday])
      output_dict[weekday] = percentage_str

    
  return output_dict


In [None]:
get_weekday_post_distribution(df)

{'Wednesday': '14.89%',
 'Friday': '14.79%',
 'Thursday': '14.75%',
 'Tuesday': '14.54%',
 'Monday': '14.31%',
 'Saturday': '13.76%',
 'Sunday': '12.96%'}

### P1.2.3 - The 100 most passionate redditors (7 marks)

We would like to know which are the 100 redditors (`author` column) that are most passionate. We will measure this by checking, for each redditor, the ratio at which they use adjectives. This ratio will be computed by dividing number of adjectives by the total number of words each redditor used. The analysis will only consider redditors that have written at least 1000 words.

**What to implement:** A function called `get_passionate_redditors(df)` that takes as input the original dataframe and returns a list of the top 100 redditors (authors) by the ratio at which they use adjectives considering both the `title` and `selftext` columns. The returned list should be a list of tuples, where each inner tuple has two elements: the redditor (author) name, and the ratio of adjectives they used. The returned list should be sorted by adjective ratio in descending order (highest first). Only redditors that wrote more than 1000 words should be considered. You should use `nltk`'s `word_tokenize` and `pos_tag` functions to tokenize and find adjectives. You do not need to do any preprocessing like stopword removal, lemmatization or stemming.

In [None]:
#def get_passionate_redditors(df):
  # your answer here
import nltk
from nltk import word_tokenize, pos_tag
from collections import defaultdict

def get_passionate_redditors(df):
    redditor_dict = defaultdict(lambda: [0, 0]) # dictionary to store count of adjectives and total words
    for index, row in df.iterrows():
        # combine title and selftext columns into one string
        text = row['title'] + ' ' + row['selftext'] 
        tokens = word_tokenize(text) # tokenize the text
        tags = pos_tag(tokens) # part of speech tagging
        adjectives = [tag[0] for tag in tags if tag[1] in ['JJ', 'JJR', 'JJS']] # extract adjectives
        num_adjectives = len(adjectives)
        num_words = len(tokens)
        author = row['author']
        redditor_dict[author][0] += num_adjectives
        redditor_dict[author][1] += num_words
        passionate_redditors = []
    for author, counts in redditor_dict.items():
        if counts[1] >= 1000: # only consider redditors with more than 1000 words
            adj_ratio = counts[0] / counts[1] # calculate adjective ratio
            passionate_redditors.append((author, adj_ratio))
            passionate_redditors.sort(key=lambda x: x[1], reverse=True) # sort by adjective ratio
    return passionate_redditors[:100] # return top 100 redditors


In [None]:
import nltk

def get_passionate_redditors(df):
    # Tokenize the full_text column and tag the parts of speech
    df['full_text_tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['title'] + '. ' + row['selftext']), axis=1)
    df['pos_tags'] = df.apply(lambda row: nltk.pos_tag(row['full_text_tokenized']), axis=1)

    # Calculate the number of adjectives and total words for each author
    author_counts = {}
    for _, row in df.iterrows():
        author = row['author']
        if author not in author_counts:
            author_counts[author] = {'adjectives': 0, 'total_words': 0}
        for word, pos in row['pos_tags']:
            if pos.startswith('JJ'):
                author_counts[author]['adjectives'] += 1
            author_counts[author]['total_words'] += 1

    # Calculate adjective ratio for each author and filter by word count
    passionate_redditors = []
    for author, counts in author_counts.items():
        if counts['total_words'] >= 1000:
            adj_ratio = counts['adjectives'] / counts['total_words']
            passionate_redditors.append((author, adj_ratio))

    # Sort by adjective ratio in descending order and return the top 100
    passionate_redditors.sort(key=lambda x: x[1], reverse=True)
    return passionate_redditors[:100]


In [None]:
get_passionate_redditors(df)

[('OhanianIsTheBest', 0.14718787395293179),
 ('healrstreettalk', 0.13043478260869565),
 ('FreedomBoners', 0.12429565793834936),
 ('factfind', 0.11272504091653028),
 ('Travis-Cole', 0.10551181102362205),
 ('SecretAgentIceBat', 0.10299145299145299),
 ('fullbloodedwhitemale', 0.10263929618768329),
 ('GeAlltidUpp', 0.09623624782860452),
 ('backpackwayne', 0.09612277867528271),
 ('Tripmooney', 0.09552495697074011),
 ('mission_improbables', 0.09512719455392332),
 ('nyello-2000', 0.09355067328136074),
 ('EMB1981', 0.09316101238556812),
 ('greyuniwave', 0.0912906610703043),
 ('th3allyK4t', 0.09067357512953368),
 ('Venus230', 0.08927108927108927),
 ('35quai', 0.08869565217391304),
 ('kent_k', 0.08856345885634588),
 ('120inn[***]', 0.08775137111517367),
 ('theinfinitelight', 0.08732534930139721),
 ('notinferno', 0.08640776699029126),
 ('rrixham', 0.08589458922519574),
 ('Ninten-Doh', 0.08562197092084006),
 ('kay278', 0.08499005964214712),
 ('secretymology', 0.08494031221303948),
 ('society0', 0.

In [None]:
get_passionate_redditors(df)

[('FreedomBoners', 0.12681912681912683),
 ('EMB1981', 0.09973753280839895),
 ('backpackwayne', 0.09927710843373494),
 ('Travis-Cole', 0.09859154929577464),
 ('SecretAgentIceBat', 0.09746954076850985),
 ('mission_improbables', 0.08791448516579406),
 ('GeAlltidUpp', 0.08745369624738283),
 ('factfind', 0.08734804142278253),
 ('rrixham', 0.08634933123524784),
 ('yellowsnow2', 0.0854119425547997),
 ('greyuniwave', 0.0852627710400588),
 ('The_In-Betweener', 0.07946278679350867),
 ('Stoaticor', 0.07934238741958542),
 ('120inn[***]', 0.07908163265306123),
 ('clemaneuverers', 0.07362377575143532),
 ('dontbuyanylogos', 0.07287093942054433),
 ('SlobBarker', 0.07096774193548387),
 ('reddit_loves_pedos', 0.07011915673693858),
 ('CuteBananaMuffin', 0.0700209927541139),
 ('CommonEmployment2', 0.06983511154219205),
 ('Benster_ninja', 0.06855357471053115),
 ('BlindingTwilight', 0.0669710806697108),
 ('AutoModerator', 0.066797161732322),
 ('Kinmuan', 0.06666666666666667),
 ('Pretty_iin_Pink', 0.06634544

## P1.3 Ethics (10 marks)

Imagine you are **the head of a data mining company** that needs to use the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as conspiracy or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). Some
information about the project and the team:

- Your client is a political party concerned about misinformation.
- The project requires mining Facebook, Reddit and Instagram data.
- The team consists of Joe, an American mathematician who just finished college; Fei, a senior software engineer from China; and Francisco, a data scientist from Spain.

Reflect on the impact of exploiting data science for such an application. You should map your discussion to one of the five actions outlined in the UK’s Data Ethics Framework.

Your answer should address the following:
- Identify the action in which your project is the weakest.
- Then, justify your choice by critically analyzing the three key principles for that action outlined
in the Framework, namely transparency, accountability and fairness.
- Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case.

Your answer should be between 500 and 700 words. **You are strongly encouraged to follow a scholarly approach, e.g., with references to peer reviewed publications. References do not count towards the word limit**.



---

Your answer here

The principle of transparency requires that data mining processes and their outcomes be made visible, understandable, and accessible to stakeholders, including users, customers, regulators, and other relevant parties. It is particularly important in the context of COVID-related content as the potential for misinformation and manipulation of public opinion is high.

The three key principles for the action of transparency are as follows:

Accessibility - the data mining process and its outcomes must be easily accessible to stakeholders and users, such as regulators, customers, and other relevant parties. This ensures that users can understand the process and outcomes, and that stakeholders can make informed decisions.

Understandability - the data mining process and its outcomes must be presented in a clear and concise manner that is easy to understand. This ensures that users can interpret and make sense of the outcomes and that stakeholders can assess the impact of the process on the wider community.

Clarity - the data mining process and its outcomes must be presented in a transparent manner, with no hidden or undisclosed elements. This ensures that the process is open, and that users and stakeholders can make informed decisions and judgments.

One possible solution to address the lack of transparency is to develop an online dashboard that presents the data mining process and its outcomes in a clear and accessible manner. The dashboard would provide an overview of the process, including the data sources, algorithms used, and how the data is analyzed. It would also provide an overview of the outcomes, including the number of flagged posts, and the percentage of posts that were flagged as conspiracy theories. This would help to make the process transparent and easily understandable to users and stakeholders.

Moreover, it is important to ensure that the system is fair and unbiased, meaning that it is not designed to specifically target a particular group or ideology. This can be achieved by conducting regular audits and reviews of the system to ensure that it is operating in a fair and unbiased manner. Additionally, the team should ensure that the data used in the system is representative of a wide range of sources and viewpoints.

In conclusion, while data science can be a powerful tool in identifying and flagging misinformation in social media, it is crucial to ensure that the project adheres to ethical considerations, such as transparency, fairness, and accountability. The development of an online dashboard and conducting regular audits and reviews can help to address the lack of transparency and ensure the system is operating fairly and without bias.





