<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Presidential-Election-Data" data-toc-modified-id="Presidential-Election-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Presidential Election Data</a></span><ul class="toc-item"><li><span><a href="#Online-dataset" data-toc-modified-id="Online-dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Online dataset</a></span></li><li><span><a href="#Web-Scraping" data-toc-modified-id="Web-Scraping-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Web Scraping</a></span></li></ul></li><li><span><a href="#Movie-list" data-toc-modified-id="Movie-list-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Movie list</a></span><ul class="toc-item"><li><span><a href="#Web-Scraping" data-toc-modified-id="Web-Scraping-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Web Scraping</a></span></li><li><span><a href="#Online-dataset" data-toc-modified-id="Online-dataset-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Online dataset</a></span></li><li><span><a href="#Online-repository" data-toc-modified-id="Online-repository-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Online repository</a></span></li><li><span><a href="#Final-movie-list" data-toc-modified-id="Final-movie-list-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Final movie list</a></span></li></ul></li><li><span><a href="#Average-Movie-ratings" data-toc-modified-id="Average-Movie-ratings-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Average Movie ratings</a></span></li><li><span><a href="#Tweets" data-toc-modified-id="Tweets-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Tweets</a></span></li></ul></div>

<div style="text-align:center; font-size:25px; color:#A12A0B"><strong>Popularity of LGBT/Feminist movies by state in US</strong></div> <br>
<div style='text-align:center; font-size:20px'><b>Yuejun Wu</b></div>
<div style='text-align:center; font-size:20px'>Open Data Mashups</div>
<div style="text-align:center; font-size:15px"><em>FALL2018</em></div>

In [108]:
# import all necessary packages
import tweepy
import pandas as pd
import requests as req
from lxml import etree
from bs4 import BeautifulSoup
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from tqdm import tqdm_notebook
import csv
import time
from textblob import TextBlob

## Presidential Election Data
There will be 6 election data from 1996 to 2016 (presidential election is held every 4 years). Data from 2000 to 2016 are in standard format online while data of 1996 needs to perform web scraping. 

- **Data set**: Presidential election data from 1996 to 2016. <br>
- **Goal**: Identify conservative, liberal and swing states. For example, if a state voted for Republic Party more than 3 times during 1996 to 2016, it will be labeled as "Conservative." Since there are 6 election years' data, a state voted for one Party 3 times will be labeled as "Swing."<br>
- **Output**: 2 columns and 52 rows representing states and their corresponding political tendencies. <br><br>
- **Data Source:** 
    1. Existing csv/xls/xlw files online
    2. Web Scraping 1996 election data online

### Online dataset

**Current status**: Standard format files are downloaded. Further step will be conducted to read files into dataframe and identify each state's voting result. <br><br>
Online source: https://transition.fec.gov/pubrec/electionresults.shtml <br>
It contains data of 2000, 2004, 2008, 2012, 2016.

### Web Scraping
**Current status**: Webscraping for 1996 voting result is done as shown below. Further step will be conducted to combine results from 1996 - 2016 to determine each state's political tendency. <br><br>

Web scraping source: https://transition.fec.gov/pubrec/fe1996/elecpop.htm <br>
It contains data from 1996.

In [2]:
# Parse htm file into text content with exception handler
def simple_get(url):
    """
    Attempts to get the content at 'url' by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                # r.text is the content of the response in unicode, 
                # and r.content is the content of the response in bytes.
                return resp.content
            else:
                return None
            
    except RequestException as e:
        log_error('Error during requests to {0}:{1}'.format(url, str(e)))
        return None

In [3]:
# Identify whether the source is in HTML/HTM format or not
def is_good_response(resp):
    """
    Return True if the response seems to be HTML/HTM, Flase otherwise.
    """
    content_type = resp.headers['Content-Type']
    return (resp.status_code == 200 
            and content_type is not None
            and content_type.find('html') > -1)

In [4]:
# Print error message
def log_error(e):
    """
    Print log errors.
    """
    print(e)

In [5]:
# Load 1996 presidential ELECTORAL AND POPULAR VOTE 
url = 'https://transition.fec.gov/pubrec/fe1996/elecpop.htm'
response = simple_get(url)
if response is not None:
    htm = BeautifulSoup(response, 'html.parser')
    # cast to string
    para = str(htm.find_all('pre'))
    temp_content = para[para.find('>AL'):]
    table_content = temp_content[1:temp_content.find('<st')]

In [6]:
# Extract content of election result into a list. Elements in the list represent rows in the raw data 
table_content_li = [x for x in table_content.split('\r\n')]

In [7]:
# Convert to nested list for better processing
# Fill empty space with 'n' indicating 'not voted' for candidates from a specific state
content = []
for row in table_content_li[:-1]:
    a = row.split('        ')
    if a[1] == '':
        a[1] = 'n'
    if a[2] == '':
        a[2] = 'n'
    content.append(a)

In [8]:
# Find exception: one list has different length from the rest due to preprocessing
content[8]

['DC', '3', 'n', '     158,220       17,339', '3,611', '      185,726 ']

In [9]:
# Exception handling
update = content[8][3] + ' ' + content[8][4]
content[8][3] = update

In [10]:
# Check whether the new list meets the requirement
temp_row = content[8]
del temp_row[4]
temp_row

['DC', '3', 'n', '     158,220       17,339 3,611', '      185,726 ']

In [11]:
# replace exception with updated list
content[8] = temp_row

In [12]:
# convert list to numpy array then to dataframe
df96 = pd.DataFrame(content)

In [13]:
# Add column names for dataframe
df96.columns = ['State', 'Clinton','Dole','Popular vote','Total Popular vote']

In [14]:
df96.head(5)

Unnamed: 0,State,Clinton,Dole,Popular vote,Total Popular vote
0,AL,n,9,"662,165 769,044 92,149",1534349
1,AK,n,3,"80,380 122,746 26,333",241620
2,AZ,8,n,"653,288 622,073 112,072",1404405
3,AR,6,n,"475,171 325,416 69,884",884262
4,CA,54,n,"5,119,835 3,828,380 697,847",10019484


In [15]:
# Write out election data of 1996
# df96.to_csv('election96.csv')

## Movie list
There are three sources to get a comprehensive movie list with LGBT and Feminism themes as shown below. Each source is handled differently based on format. Overlapping exists among movie lists from three sources as data sources are independent from each other. A final list of related movie names will be generated from them.

**Data Source:**<br>
1. Web scraping from webpages:
    * https://en.wikipedia.org/wiki/List_of_LGBT-related_films
    * https://en.wikipedia.org/wiki/Category:Feminist_films
2. Online dataset:
    * https://www.kaggle.com/juzershakir/tmdb-moviesdataset/home
3. Online repository
    * http://files.grouplens.org/datasets/movielens/ml-20m-README.html

### Web Scraping

Wikipedia has lists of LGBT/Feminism topic movies. Since LGBT and Feminism are more of movie topic rather than movie genres like Action, Adventure, Drama, Musical, etc. Movie names on Wikipedia might not be sufficient. That's why further processing on other online dataset is performed.

In [16]:
# LGBT related movie
origin_page = req.get("https://en.wikipedia.org/wiki/List_of_LGBT-related_films")

soup = BeautifulSoup(origin_page.text, "html.parser")

movie_name1 = ''
for element in soup.find_all('a'):
    if element.get('title') is not None:
        movie_name1 += (str(element.string) + "***")

# Get movie names part only
chunks = movie_name1.split('edit***')
for chunk in chunks:
    if chunk.startswith('Z'):
        z_index = chunks.index(chunk)
    if chunk.startswith('$'):
        a_index = chunks.index(chunk)

movie_list1 = chunks[a_index : z_index+1]

# convert each movie into an element of a list
movie_names1 = []
for movie_chunk in movie_list1:
    movie_temp = movie_chunk.split('***')
    movie = movie_temp[:-1]
    movie_names1.extend(movie)
    
# a list of all lgbt movie names from Wiki page
print(len(movie_names1))

2563


In [23]:
# write out lgbt movie list to csv file
# lgbt = pd.DataFrame(movie_names1, columns=["lgbt_movie"])
# lgbt.to_csv('lgbt.csv', index=False)

In [26]:
# Feminism related movies
origin_page = req.get("https://en.wikipedia.org/wiki/Category:Feminist_films")

soup = BeautifulSoup(origin_page.text, "html.parser")

movie_name2 = ''
for element in soup.find_all('a'):
    if element.get('title') is not None:
        movie_name2 += (str(element.string) + "***")

# Get movie names part only
chunks = movie_name2.split('***')
for chunk in chunks:
#     print(chunk)
    if chunk.startswith('Nor'):
        z_index = chunks.index(chunk)
    if chunk.startswith('5'):
        a_index = chunks.index(chunk)

movie_list2 = chunks[a_index : z_index+1]

# convert each movie into an element of a list
movie_names2 = []
for movie_chunk in movie_list2:
    if movie_chunk.endswith("film)"):
        movie_names2.append(movie_chunk[0:movie_chunk.index("(")-1])
    else:
        movie_names2.append(movie_chunk)
    
# a list of all lgbt movie names from Wiki page
print(len(movie_names2))

200


In [27]:
# write out feminism movie list to csv file
# feminism = pd.DataFrame(movie_names2)
# feminism.to_csv('feminism.csv', index=False)

### Online dataset
A comprehensive csv data file from kaggle.com with movie names, keywords, budgets, revenue, etc. Keywords indicate the theme and category of the film, which will be used to find LGBT/Feminism movie names.
<br><br>
**Url**: https://www.kaggle.com/juzershakir/tmdb-moviesdataset/home

In [28]:
file = pd.read_csv('/Users/amberwu/Downloads/UIUC/Course FL 2018/Open Data Mashups/Data repo/Movie avg ratings/Movie_rating2/tmdb_movies_data.csv')

# keywords for search
searchfor = ['lgbt','gay','lesb','strong woman','strong women','femin','homo']

# in order to search multiple keywords at a time
file_alter = file[file['keywords'].str.contains('|'.join(searchfor), na=False)]

In [29]:
# Get 196 rows of data: 196 movie names
file_alter.shape

(196, 21)

In [30]:
name_tmdb = file_alter[['original_title']]

In [31]:
name_tmdb.head(5)

Unnamed: 0,original_title
19,The Hunger Games: Mockingjay - Part 2
50,Carol
157,Ricki and the Flash
168,Suffragette
200,Freeheld


In [33]:
# write out movie list file
# name_tmdb.to_csv("movieName3.csv")

### Online repository

This dataset describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on March 31, 2015, and updated on October 17, 2016. <br>

The data are contained in six files, `genome-scores.csv`, `genome-tags.csv`, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`.

> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=<http://dx.doi.org/10.1145/2827872>

Column 'tag' in `tags.csv` is used to filter LGBT/Feminist topic movies.

In [141]:
# Read in data
file_rating = pd.read_csv('/Users/amberwu/Downloads/UIUC/Course FL 2018/Open Data Mashups/Data repo/Movie avg ratings/Movie_rating1/ml-20m/ratings.csv')
file_tag = pd.read_csv('/Users/amberwu/Downloads/UIUC/Course FL 2018/Open Data Mashups/Data repo/Movie avg ratings/Movie_rating1/ml-20m/tags.csv')

In [142]:
file_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [143]:
file_tag.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,1240597180
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078


In [144]:
# Read in data
file_movie = pd.read_csv('/Users/amberwu/Downloads/UIUC/Course FL 2018/Open Data Mashups/Data repo/Movie avg ratings/Movie_rating1/ml-20m/movies.csv')

In [145]:
file_movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [146]:
# Read in data
file_link=pd.read_csv('/Users/amberwu/Downloads/UIUC/Course FL 2018/Open Data Mashups/Data repo/Movie avg ratings/Movie_rating1/ml-20m/links.csv')
file_link.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [147]:
# search keywords from "tag" for lgbt/feminist related movies
name_tag = file_tag.merge(file_movie, on='movieId')
name_tag_alter = name_tag[name_tag['tag'].str.contains('|'.join(searchfor), na=False)]

In [148]:
# Extract distinct movie names
name_repo = name_tag_alter[['title']]
name_repo_unique = list(name_repo.title.unique())

In [149]:
name_repo_unique[0:5]

['L.A. Confidential (1997)',
 'Frozen (2013)',
 'American Pie (1999)',
 '40-Year-Old Virgin, The (2005)',
 'Knocked Up (2007)']

### Final movie list

Union three lists of movie names and get the final list of lgbt/feminism movies.

In [161]:
# Movie list from wikipedia.com
name_wiki = []
name_wiki.extend(movie_names1)
name_wiki.extend(movie_names2)

In [31]:
# Movie list from data file at kaggle.com 
name_tmdb_list = name_tmdb['original_title'].tolist()

In [150]:
# Movie list from online repository
name_repo_unique[0:5]

['L.A. Confidential (1997)',
 'Frozen (2013)',
 'American Pie (1999)',
 '40-Year-Old Virgin, The (2005)',
 'Knocked Up (2007)']

In [163]:
movie_names3 = [x[:-7] for x in name_repo_unique]

In [164]:
name_wiki.extend(movie_names3)

In [166]:
# Get unique movie names
final_movie_list = set(name_wiki)
final_movie_list = list(final_movie_list)

In [172]:
final_movie_list[0:10]

['Fake ID',
 'Four Windows',
 'Kissing Jessica Stein',
 'Boogie Woogie',
 'Memento Mori',
 'Gohatto',
 "Alice Doesn't Live Here Anymore",
 'La Luciérnaga',
 'The Lady Assassin',
 'The Man in Her Life']

## Average Movie ratings

Average movie ratings are obtained from an online repository used [here](#Online-repository) to get move names. After I got the final movie list I will extract average movie ratings specifically for them. Below is an example of how to perform methods of "groupby", "mean()" to get average movie ratings as each movie has multiple ratings from different users. <br><br>
**Data Source:** <br>
- Online repository:
    http://files.grouplens.org/datasets/movielens/ml-20m-README.html

In [206]:
# Merge files on userId and movieId to get all individual ratings for movieid
movie_rating = name_tag_alter.merge(file_rating, on=['userId','movieId'])

In [207]:
movie_rating.head(10)

Unnamed: 0,userId,movieId,tag,timestamp_x,title,genres,rating,timestamp_y
0,63618,1617,homoerotic subtext,1368242588,L.A. Confidential (1997),Crime|Film-Noir|Mystery|Thriller,4.5,1351591811
1,96,106696,feminist,1396097502,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,3.5,1396097287
2,12131,106696,feminist,1419021976,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,4.0,1419021959
3,24134,106696,feminist,1390933092,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,5.0,1390933046
4,56879,106696,feminist,1417918347,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,4.5,1421637283
5,57434,106696,feminist,1388906221,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,4.0,1388906210
6,79167,106696,feminist,1390787321,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,3.5,1390787309
7,84441,106696,feminist,1416016304,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,2.5,1416015906
8,86761,106696,feminist,1420412577,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,4.5,1420412530
9,102576,106696,feminist,1419916572,Frozen (2013),Adventure|Animation|Comedy|Fantasy|Musical|Rom...,4.5,1419916563


In [208]:
# Calculate average movie ratings
movieid_rating = movie_rating.groupby('movieId')[['rating']].mean()
movieid_rating=movieid_rating.reset_index()

In [209]:
# The above dataframe only has movieid not movie name. 
# Merge files to get ratings for movie titles
moviename_rating = movieid_rating.merge(file_movie, on='movieId')
moviename_rating.head(10)

Unnamed: 0,movieId,rating,title,genres
0,35,4.5,Carrington (1995),Drama|Romance
1,49,4.5,When Night Is Falling (1995),Drama|Romance
2,82,3.583333,Antonia's Line (Antonia) (1995),Comedy|Drama
3,141,3.576923,"Birdcage, The (1996)",Comedy
4,171,4.125,Jeffrey (1995),Comedy|Drama
5,178,3.5,Love & Human Remains (1993),Comedy|Drama
6,198,2.5,Strange Days (1995),Action|Crime|Drama|Mystery|Sci-Fi|Thriller
7,203,4.0,"To Wong Foo, Thanks for Everything! Julie Newm...",Comedy
8,219,4.5,"Cure, The (1995)",Drama
9,233,4.0,Exotica (1994),Drama


In [191]:
moviename_rating[(moviename_rating['title'] == 'When Night Is Falling (1995)') | (moviename_rating['title'] == 'Carrington (1995)')]

Unnamed: 0,movieId,rating,title,genres
0,35,4.5,Carrington (1995),Drama|Romance
1,49,4.5,When Night Is Falling (1995),Drama|Romance


In [210]:
moviename_rating['title'] = moviename_rating['title'].apply(lambda title:title[:-7].lower())

Unnamed: 0,movieId,rating,title,genres
264,39183,3.723214,brokeback mountain,Drama|Romance


In [214]:
nameList = ['Anatomy of Hell','being julia', 'brokeback mountain','call me by your name','Carol','Christopher and his Kind',
           'Fabulous! The Story of Queer Cinema','Fellini Satyricon','Fight Club','Henry & June','I Now Pronounce You Chuck and Larry',
           'I Spit on Your Grave','Iron Jawed Angels',"Jennifer's Body",'Lawrence of Arabia','Merry Christmas, Mr. Lawrence',
           "Miller's Crossing",'Mona Lisa Smile','My Beautiful Laundrette','My Own Private Idaho','Naked Vengeance',"Naomi and Ely's No Kiss List",
           'Pink Narcissus','The Ballad of Josie','The Hunger Games- Mockingjay','The love witch','The Picture of Dorian Gray',
           'The Wedding Banquet','Waiting for Guffman']
nameList2 = [x.lower() for x in nameList]
sampleName = moviename_rating[moviename_rating['title'].str.contains('|'.join(nameList2), na=False)]

In [215]:
sampleName

Unnamed: 0,movieId,rating,title,genres
50,1611,3.0,my own private idaho,Drama|Romance
99,2959,5.0,fight club,Action|Crime|Drama|Thriller
154,5367,3.9,my beautiful laundrette,Drama|Romance
198,7154,3.5,mona lisa smile,Drama|Romance
206,7487,3.0,henry & june,Drama
264,39183,3.723214,brokeback mountain,Drama|Romance
302,54004,2.375,i now pronounce you chuck and larry,Comedy|Romance
362,71205,3.3,jennifer's body,Comedy|Horror|Sci-Fi|Thriller


In [218]:
# Add movie ratings for the rest of sample movies selected
sampleName.loc[8] = ['x',2.25, 'anatomy of hell', 'x']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [234]:
rating = {'Anatomy of Hell':2.25, 'being julia':3.5,'call me by your name':3.95,'Carol': 3.6,'Christopher and his Kind':3.5,
         'Fabulous! The Story of Queer Cinema':3.45, 'Fellini Satyricon':3.5,
           'I Spit on Your Grave':3.15,'Iron Jawed Angels':3.8,'Lawrence of Arabia':4.15,'Merry Christmas, Mr. Lawrence':3.65,
           "Miller's Crossing":3.9,'Naked Vengeance':3.1,"Naomi and Ely's No Kiss List":2.8,
           'Pink Narcissus':3.35,'The Ballad of Josie':2.9,'The Hunger Games- Mockingjay':3.35,'The love witch':3.1,
           'The Picture of Dorian Gray':3.8,'The Wedding Banquet':3.85,'Waiting for Guffman':3.8 }

temp_name = ['Anatomy of Hell','being julia', 'call me by your name','Carol','Christopher and his Kind',
           'Fabulous! The Story of Queer Cinema','Fellini Satyricon',
           'I Spit on Your Grave','Iron Jawed Angels','Lawrence of Arabia','Merry Christmas, Mr. Lawrence',
           "Miller's Crossing",'Naked Vengeance',"Naomi and Ely's No Kiss List",
           'Pink Narcissus','The Ballad of Josie','The Hunger Games- Mockingjay','The love witch','The Picture of Dorian Gray',
           'The Wedding Banquet','Waiting for Guffman']

for i in range(21):
    sampleName.loc[i] = ['x',rating[temp_name[i]], temp_name[i], 'x']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [240]:
# Normalize rating to (-1,1) to compare with polarity for sentimental analysis
sampleName['rating'] = sampleName['rating'].apply(lambda x:(x*(2/5)-1))
sampleName

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,movieId,rating,title,genres
50,1611,0.2,my own private idaho,Drama|Romance
99,2959,1.0,fight club,Action|Crime|Drama|Thriller
154,5367,0.56,my beautiful laundrette,Drama|Romance
198,7154,0.4,mona lisa smile,Drama|Romance
206,7487,0.2,henry & june,Drama
264,39183,0.489286,brokeback mountain,Drama|Romance
302,54004,-0.05,i now pronounce you chuck and larry,Comedy|Romance
362,71205,0.32,jennifer's body,Comedy|Horror|Sci-Fi|Thriller
8,x,0.52,Iron Jawed Angels,x
9,x,0.66,Lawrence of Arabia,x


## Tweets
Using Twitter API to get people's reviews on movies. Since there are search limitation for free-tier user of Twitter API (only have access to up to 7 days' tweet content) I utilize XPath to get tweet id of movies I'm interested in and then use API to get the tweet content based on tweet id. Detailed steps are as following:

1. For each movie, search movie names in Twitter.
2. Use XPath helper to get tweet id on the search result page, and save them in txt files.
3. Use Twitter API to extract content of each tweet id.

Step 1 invovles manually typing the movie name, which will be performed for each movie. Movie list I got from previous steps is over 500. In this project, I will do manual search for 50 movies. In the future, I will explore more efficient way to get tweets online.

<br>
The following code blocks show the way of getting tweet content and storing them into dataframe for further analysis on 2 movies. 

In [34]:
# authorization
API_KEY = 'Iiv9CuEL7A6sFc4rX0O7BXT62'
API_SECRET = 'th2uEdpeaa1vLaOsY49rzIfs0n0oUTh71CHTqHKdtZsrxDieqF'

ACCESS_TOKEN = '1692129691-biA1cwnmNUsiSnN3Fr1MSVAjhld831l87fdvSTq'
ACCESS_TOKEN_SECRET = 'DfCgC5TV0buMzbgTdYAWmGFGG4kSnypmT9fhcCXblx4vS'

auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# Create tweepy object for twitter API
api = tweepy.API(auth)

In [35]:
# Read from Twitter ID file
tweet_summary_map = {}
nameList = ['Anatomy of Hell','being julia', 'brokeback mountain','call me by your name','Carol','Christopher and his Kind',
           'Fabulous! The Story of Queer Cinema','Fellini Satyricon','Fight Club','Henry & June','I Now Pronounce You Chuck and Larry',
           'I Spit on Your Grave','Iron Jawed Angels',"Jennifer's Body",'Lawrence of Arabia','Merry Christmas, Mr. Lawrence',
           "Miller's Crossing",'Mona Lisa Smile','My Beautiful Laundrette','My Own Private Idaho','Naked Vengeance',"Naomi and Ely's No Kiss List",
           'Pink Narcissus','The Ballad of Josie','The Hunger Games- Mockingjay','The love witch','The Picture of Dorian Gray',
           'The Wedding Banquet','Waiting for Guffman']
for movie in nameList:
    with open("Tweet Data/Tweepy-API-xPath/"+movie+".txt", 'r') as f:
        x = f.read().splitlines()
    id_list = [line.split('/')[-1] for line in x]
    tweet_summary = pd.DataFrame(columns=['Timezone', 'Full Tweet', 'user_name', 'user_location', 'coordinates', 'country_code', 'place'])
    tweet_summary.index.name = 'Tweet Time'
    for id in tqdm_notebook(id_list):
        try:
            tweet_info = api.get_status(id, lang = 'en', tweet_mode='extended')
            if 'retweeted_status' in dir(tweet_info):
                tweet=tweet_info.retweeted_status.full_text
            else:
                tweet=tweet_info.full_text
            if tweet_info.place:
                place = tweet_info.place.full_name
                country_code = tweet_info.place.country_code
            else:
                place = None
                country_code = None
        except:
            pass
        tweet_summary.loc[tweet_info.created_at] = [tweet_info.user.time_zone, tweet, tweet_info.user.name, tweet_info.user.location, tweet_info.coordinates, country_code, place]
    tweet_summary_map[movie] = tweet_summary
    time.sleep(1)

HBox(children=(IntProgress(value=0, max=72), HTML(value='')))




HBox(children=(IntProgress(value=0, max=101), HTML(value='')))




HBox(children=(IntProgress(value=0, max=241), HTML(value='')))




HBox(children=(IntProgress(value=0, max=216), HTML(value='')))




HBox(children=(IntProgress(value=0, max=263), HTML(value='')))




HBox(children=(IntProgress(value=0, max=125), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=91), HTML(value='')))




HBox(children=(IntProgress(value=0, max=376), HTML(value='')))




HBox(children=(IntProgress(value=0, max=86), HTML(value='')))




HBox(children=(IntProgress(value=0, max=288), HTML(value='')))




HBox(children=(IntProgress(value=0, max=201), HTML(value='')))




HBox(children=(IntProgress(value=0, max=197), HTML(value='')))




HBox(children=(IntProgress(value=0, max=169), HTML(value='')))




HBox(children=(IntProgress(value=0, max=379), HTML(value='')))




HBox(children=(IntProgress(value=0, max=286), HTML(value='')))




HBox(children=(IntProgress(value=0, max=397), HTML(value='')))




HBox(children=(IntProgress(value=0, max=105), HTML(value='')))




HBox(children=(IntProgress(value=0, max=209), HTML(value='')))




HBox(children=(IntProgress(value=0, max=347), HTML(value='')))




HBox(children=(IntProgress(value=0, max=52), HTML(value='')))




HBox(children=(IntProgress(value=0, max=170), HTML(value='')))




HBox(children=(IntProgress(value=0, max=182), HTML(value='')))




HBox(children=(IntProgress(value=0, max=68), HTML(value='')))




HBox(children=(IntProgress(value=0, max=431), HTML(value='')))




HBox(children=(IntProgress(value=0, max=191), HTML(value='')))




HBox(children=(IntProgress(value=0, max=412), HTML(value='')))




HBox(children=(IntProgress(value=0, max=112), HTML(value='')))




HBox(children=(IntProgress(value=0, max=184), HTML(value='')))




In [82]:
# tweets with user locations
counter = 0
collect = []
for movie in nameList:
    counter += 1
    name1 = "m"+str(counter)
    name2 = "m"+str(counter)+"_filtered"
    name1 = tweet_summary_map[movie]
    name2 = name1[name1['user_location'].notnull()]
    collect.append(name2)
    
twitter_loc = pd.concat(collect)
twitter_loc.to_csv("twitter_loc.csv")

In [84]:
# After hand-cleaning twitter_loc file, need to calculate number of tweets of each state
twitter_state = pd.read_csv("twitter_loc_v3.csv")

In [87]:
# Number of tweet each state
twitter_stateNum = twitter_state.groupby(['State']).size()

In [88]:
# Output tweets by state file
# twitter_stateNum.to_csv("twitter_stateNum.csv")

In [267]:
def twitter_polarity(movie_name): 
    test1 = tweet_summary_map[movie_name][['Full Tweet','place']]
    test1_filtered = test1[test1['place'].notnull()]
    test1_filtered['State'] = test1_filtered['place'].apply(lambda state: state[-2:])
    test1_filtered['polarity'] = test1_filtered['Full Tweet'].apply(lambda tweet:TextBlob(tweet).sentiment.polarity)
    try:
        test1_filtered = test1_filtered.groupby('State')[['polarity']].mean()
        test1_filtered['movie'] = movie_name
    except:
        pass
    return test1_filtered

In [276]:
nameList = ['Anatomy of Hell','being julia', 'brokeback mountain','call me by your name','Carol','Christopher and his Kind',
           'Fabulous! The Story of Queer Cinema','Fellini Satyricon','Fight Club','Henry & June','I Now Pronounce You Chuck and Larry',
           'I Spit on Your Grave','Iron Jawed Angels',"Jennifer's Body",'Lawrence of Arabia','Merry Christmas, Mr. Lawrence',
           "Miller's Crossing",'Mona Lisa Smile','My Beautiful Laundrette','My Own Private Idaho','Naked Vengeance',"Naomi and Ely's No Kiss List",
           'Pink Narcissus','The Ballad of Josie','The Hunger Games- Mockingjay','The love witch','The Picture of Dorian Gray',
           'The Wedding Banquet','Waiting for Guffman']
collection = []
for movie in nameList:
    collection.append(twitter_polarity(movie))

polarity_tweet = pd.concat(collection)[['movie','polarity']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [284]:
polarity_tweet = polarity_tweet.reset_index()

In [288]:
polarity_tweet['movie'] = polarity_tweet['movie'].apply(lambda x:x.lower())

In [290]:
sampleName['title'] = sampleName['title'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [291]:
polarity_rating = polarity_tweet.merge(sampleName, left_on='movie', right_on='title')

In [294]:
polarity_rating = polarity_rating[['index','movie','polarity','rating']]

In [296]:
polarity_rating['diff'] = polarity_rating['polarity'] - polarity_rating['rating']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [297]:
polarity_rating

Unnamed: 0,index,movie,polarity,rating,diff
0,CA,being julia,0.000000,0.400000,-0.400000
1,NY,being julia,0.068182,0.400000,-0.331818
2,TX,being julia,0.000000,0.400000,-0.400000
3,as,being julia,0.426667,0.400000,0.026667
4,CA,brokeback mountain,0.100000,0.489286,-0.389286
5,IN,brokeback mountain,-0.350000,0.489286,-0.839286
6,NY,brokeback mountain,0.266667,0.489286,-0.222619
7,RI,brokeback mountain,0.126667,0.489286,-0.362619
8,SA,brokeback mountain,0.200000,0.489286,-0.289286
9,VA,brokeback mountain,0.000000,0.489286,-0.489286


In [298]:
states = ['AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA','HI','IA','ID','IL','IN',
          'KS','KY','LA','MA','MD','ME','MI','MN','MO','MS','MT','NC','ND','NE','NH','NJ',
          'NM','NV','NY','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VA','VT','WA',
          'WI','WV','WY']

In [299]:
real_states = polarity_rating[polarity_rating['index'].str.contains('|'.join(states), na=False)]


In [300]:
real_states


Unnamed: 0,index,movie,polarity,rating,diff
0,CA,being julia,0.0,0.4,-0.4
1,NY,being julia,0.068182,0.4,-0.331818
2,TX,being julia,0.0,0.4,-0.4
4,CA,brokeback mountain,0.1,0.489286,-0.389286
5,IN,brokeback mountain,-0.35,0.489286,-0.839286
6,NY,brokeback mountain,0.266667,0.489286,-0.222619
7,RI,brokeback mountain,0.126667,0.489286,-0.362619
9,VA,brokeback mountain,0.0,0.489286,-0.489286
11,IL,call me by your name,-0.075,0.58,-0.655
12,NY,call me by your name,-0.0625,0.58,-0.6425
