# Introduction: *Identification of Russian Trolls on Twitter*

< [**Previous:** 00 - Project Guidelines](00-ProjectGuidelines.ipynb) | [**Next**: 02 - Import](02-Import.ipynb) >

Written by Sarthak Khillon.

Created for Professor Brian Granger's *DATA 301: Introduction to Data Science* course at Cal Poly SLO. 

## Index

- [00 - Project Guidelines](00-ProjectGuidelines.ipynb)
- 01 - Introduction (you are here)
- [02 - Import](02-Import.ipynb)
- [03 - Tidy](03-Tidy.ipynb)
- [04 - EDA](04-EDA.ipynb)
- [05 - Modeling](05-Modeling.ipynb)
- [06 - Presentation](06-Presentation.ipynb)

## Abstract

Recently, evidence has been mounting that Russia influenced the 2016 U.S. Presidential election, primarily by creating social media buzz to polarize voters against the other side. Social media feeds further reinforce this echo chamber by serving content that users would like (see [Blue Feed, Red Feed](http://graphics.wsj.com/blue-feed-red-feed/) from the Wall Street Journal). Social media giants have begun flagging content from questionable sources in an effort to combat this effect. In December 2016, Facebook [began flagging fake news](https://newsroom.fb.com/news/2016/12/news-feed-fyi-addressing-hoaxes-and-fake-news/) but [stopped in December 2017](https://medium.com/facebook-design/designing-against-misinformation-e5846b3aa1e2) due to a variety of reasons, one being that it [may backfire and further reinforce biased opinions](http://journals.sagepub.com/doi/pdf/10.1177/1529100612451018). 

However, content shared on Twitter is different as it is less of a series of links and more of a series of user opinions. An exclusive article from NBC revealed a data set of over 200,000 tweets from over 400 accounts marked as Russian trolls. **The aim of this project is to create an algorithm that identifies accounts as Russian trolls.**

## Data Sets

There are 4 raw data sets and 2 main data sets, all of which are located in `/data/skhillon/`. The two main data sets are `tweets.csv` and `users.csv`, which are composed of their `pol` and `troll` sub-parts.

See the [Twitter Developer Documentation](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object) for a complete reference of the fields. Note that the documentation may have been updated since the creation of this dataset.

**tweets.csv**: Contains tweets by users flagged as Russian Trolls.
    - user_id (int): Twitter User ID.
    - user_key (str): User's "handle"
    - created_at (int): Account creation date (assumed to be number of seconds since UNIX Epoch).
    - retweet_count (int): Number of users who retweeted this tweet.
    - retweeted (str): A boolean status indicating whether this tweet was retweeted. *On visual scan, values are either empty or FALSE"*
    - favorite_count (int): Number of users who favorited this tweet.
    - text (str): The tweet itself.
    - tweet_id (int): A numeric identifier for this specific tweet.
    - source (str): An HTML link.
    - hashtags ([str]): An array of hashtags found in the tweet.
    - expanded_urls ([str]): Twitter encodes URLs into its own "t.co"-type links. This array contains a list of the original URLs found in the tweet.
    - posted (str): Not sure; seems to indicate whether or not the tweet was posted. *On visual scan, all values here are the same: "POSTED"*
    - mentions ([str]): Array of other Twitter users mentioned in the tweet.
    - retweeted_status_id (int): If this tweet is a retweet, this field indicates the id of the original tweet.
    - in_reply_to_status_id (int): If this tweet is replying to another tweet, this field indicates the id of the original tweet.

**users.csv**: Contains information on each user flagged as a Russian troll.
    - id (int): Twitter User ID.
    - location (str): User's Location
    - followers_count (int): Number of users who follow this user.
    - statuses_count (int): Number of tweets posted.
    - time_zone (str): User's Time Zone.
    - verified (str): A boolean status, TRUE or FALSE, indicating whether or not the user's account is verified by Twitter. *On visual scan, all values seem to be "FALSE"*
    - lang (str): User's profile language setting.
    - screen_name (str): User's "handle".
    - description (str): How the user describes themselves; a "bio".
    - created_at (str): Account creation date.
    - favourites_count (int): Number of tweets favorited.
    - friends_count (int): Number of friends on Twitter.
    - listed_count (int): Number of times this user appears on another user's list (ex: "top ten celebrities")

## Citations

- I first found the troll datasets mentioned in a Reddit post from [/r/datasets](https://www.reddit.com/r/datasets/) titled [*"200K tweets from Russian trolls manipulating 2016 election; deleted by twitter, unavailable elsewhere"*](https://www.google.com). The post was created by [/u/everywhere_anyhow](https://www.reddit.com/user/everywhere_anyhow) on Wednesday, February 14, 2018 at 12:31:25 UTC.
- The Reddit post links to an article from NBC News titled [*"Twitter deleted 200,000 Russian troll tweets. Read them here."*](https://www.nbcnews.com/tech/social-media/now-available-more-200-000-deleted-russian-troll-tweets-n844731) The article, which contains links to the raw data files, was published by Ben Popken on February 14, 2018 at 1:55 AM EST. Data was obtaining by running `wget` in `/data/skhillon/` with links to the 2 datasets.
- The politician, non-troll tweets were found at a [data.world](https://data.world/bkey/politician-tweets) post, which was also obtained by running `wget` in `/data/skhillon/`. The dataset was uploaded on April 7, 2017.

< [**Previous:** 00 - Project Guidelines](00-ProjectGuidelines.ipynb) | [**Next**: 02 - Import](02-Import.ipynb) >