# Social media analysis of Movies 

This project aims to model a database around the Movies domain. I have used IMDB to scrape movie details (like name, description, genre, actors) of movies  released in 2018. 
I will be integrating this with Twitter and Facebook data, for tagging, sentiment analysis, engagement analysis, reach analysis and personalized recommendation, to answer questions related to users, posts and movies using Text mining and Natural Language Processing.

In [1]:
#calling dependencies
import pandas as pd
import numpy as np
import sqlite3

## Datasets
-imdb_movie_dataset.csv has the movie titles,genre,actors,movie description, rating. This was scraped from the IMDB website. 

-post_data_fb.csv has the Posts retrived from Facebook using the API

-comments_posts_fb.csv has the comments of the posts (above), retrieved using the Facebook API

-Tweets.csv has the Tweets and users(Twitter users) details, retireved using the Twitter API

-Movie_Tweets.csv has all the tweets retrieved for a particular movie consolidated into a single string.

The script to get Tweets and Posts from Twitter and Facebook is hosted on AWS, to keep it running so that I get a lot of data to work on. The script writes the data to the CSVs.

Facebook and Twitter give us different kinds of data, so I plan on making use of this difference to get answers to a variety of questions, using either one or both of the sources combined.

# Tagging

I have used Tf-idf to determine what words are used frequently while talking about a certain movie. I have compared the consolidated tweets of different movies.

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
import math

In [None]:
#Each blob is a document of colsolidated tweets of a movie. 
#Bloblist is a list of all 35 Blobs corresponding to the 35 movies

movie_tweets=pd.read_csv("Movie_Tweets.csv")

for tweets in movie_tweets.itertuples():
    print tb(str(tweets[3]).decode('utf-8'))
bloblist=[]

for tweets in movie_tweets.itertuples():
    bloblist.append(tb(str(tweets[3]).decode('utf-8')))

# Tf-Idf 

In [None]:
from __future__ import division

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    n=0
    for blob in bloblist:
        if word in blob.words:
            n=n+1
    return n

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)



In [None]:
#Generating scores for all words in the tweets, after removing Stopwords, punctuations 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt','RT','via']

scores=pd.DataFrame(pd.DataFrame(columns=['movie_id','word','score']))
i=1
j=0
for blob in bloblist:
    for word in blob.words:
        if word not in stop and word.isalpha():
            scores.loc[str(i)]=[movie_ids[j],word,tfidf(word, blob, bloblist)]
            i=i+1
    j=j+1
    
#Sort the words on basis of the scores, and keep only the 5 highest scoring words for each movie.
#These 5 most frequent and RELEVANT words are the Tags for that movie
scores=scores.sort_values(by=['score'],ascending=0)
scores=scores.drop_duplicates(keep='first')
scores=scores.groupby('movie_id').head(5).reset_index(drop=True)

In [2]:
#reading the movie dataset
movie_data = pd.read_csv("imdb_movie_dataset.csv")

In [3]:
#reading post dataset
post_data = pd.read_csv("post_data_fb.csv")
post_data = post_data.drop(['Unnamed: 0'], axis = 1)

In [4]:
#reading comment dataset
comment_data = pd.read_csv("comments_posts_fb.csv")
comment_data = comment_data.drop(['Unnamed: 0'], axis = 1)

In [42]:
#reading tweets database
tweets_data = pd.read_csv("Tweets.csv")

In [59]:
#reading the scores
tag_scores = pd.read_csv("scores.csv")
tag_scores = tag_scores[['movie_id', 'word', 'score']]
tag_scores.head()

Unnamed: 0,movie_id,word,score
0,173003000000000.0,,2.456736
1,126481000000000.0,,2.456736
2,372742000000000.0,mazerunnermovie,0.301284
3,372742000000000.0,dylanobrien,0.301284
4,372742000000000.0,sangsterthomas,0.301284


## Creating master tables and normalizing the data in 3rd Normal form


In [6]:
#main table - All the other tables will be connected to this one
movie_master_table = pd.DataFrame()
movie_master_table[['movie_id','movie_names','movie_description','imdb_ratings','metascores','runtime','gross_value','year_release']] = movie_data[['Movie_id','movie_names','movie_description','imdb_ratings','metscores','runtime','gross_value','year_release']]
movie_master_table = movie_master_table.drop_duplicates()
movie_master_table.head(3)

Unnamed: 0,movie_id,movie_names,movie_description,imdb_ratings,metascores,runtime,gross_value,year_release
0,635054000000000.0,Black Panther,"[""T'Challa, the King of Wakanda, rises to the ...",7.9,88,134,291954422.0,2018
16,1071120000000000.0,The Cloverfield Paradox,"['Orbiting a planet on the brink of war, scien...",5.7,37,102,,2018
32,372742000000000.0,Maze Runner: The Death Cure,['Young hero Thomas embarks on a mission to fi...,6.8,51,141,55366604.0,2018


In [7]:
#starcast main table - this will connect to movie table using a seperate relational table
starcast_master_table = pd.DataFrame()
starcast_master_table[['starcast_id','starcast_name']] = movie_data[['star_cast_id','star_cast']]
starcast_master_table = starcast_master_table.drop_duplicates()
starcast_master_table.head(3)

Unnamed: 0,starcast_id,starcast_name
0,10676,Ryan Coogler
4,10677,Chadwick Boseman
8,10678,Michael B. Jordan


In [8]:
#director main table - this will connect to movie table using a seperate relational table
director_master_table = pd.DataFrame()
director_master_table[['director_id','director_name']] = movie_data[['director_id','director_name']]
director_master_table = director_master_table.drop_duplicates()
director_master_table.head(3)

Unnamed: 0,director_id,director_name
0,1350,Ryan Coogler
16,1351,Julius Onah
32,1352,Wes Ball


In [9]:
#genre main table - this will connect to movie table using a seperate relational table
genre_master_table = pd.DataFrame()
genre_master_table[['genre_id','genre']] = movie_data[['genre_id','genre']]
genre_master_table = genre_master_table.drop_duplicates()
genre_master_table.head(3)

Unnamed: 0,genre_id,genre
0,100,Action
1,104,Adventure
2,102,Sci


In [10]:
#posts main table - this will connect to movie table using a seperate relational table (these are posts from facebook)
posts_master_table = pd.DataFrame()
posts_master_table[['post_id','created_at','post_message','likes_count','share_count','comments_count','user_engagement']] = post_data[['post_id','created_at','post_message','post_likes_count','post_shares_count','post_comment_count','user_engagement']]
posts_master_table = posts_master_table.drop_duplicates()
posts_master_table['created_at'] = posts_master_table['created_at'].astype('datetime64[ns]')
posts_master_table.head(3)

Unnamed: 0,post_id,created_at,post_message,likes_count,share_count,comments_count,user_engagement
0,458711740828112_1857840717581867,2018-03-12 20:00:00,,6,4.0,2,12
1,458711740828112_1858124704220135,2018-03-12 19:45:00,,7,4.0,0,11
2,458711740828112_1857840970915175,2018-03-12 19:30:00,,17,5.0,0,22


In [11]:
#comments main table - this will connect to posts table using a seperate relational table
comments_master_table = pd.DataFrame()
comments_master_table[['comments_id','created_at','comments']] = comment_data[['comments_id','created_at','post_comments']]
comments_master_table = comments_master_table.drop_duplicates()
comments_master_table['created_at'] = comments_master_table['created_at'].astype('datetime64[ns]')
comments_master_table.head(3)

Unnamed: 0,comments_id,created_at,comments
0,1857840717581867_1858341727531766,2018-03-12 20:06:32,
1,1857840717581867_1858346307531308,2018-03-12 20:11:39,"Vero grazie, buonanotte"
2,1857951664237439_1858307817535157,2018-03-12 19:21:25,Mia


In [12]:
#tweets main table - this will connect the tweets to the movies
tweets_master_table = pd.DataFrame()
tweets_master_table[['tweet_id','tweet_text', 'created_date', 'retweet_count']] = tweets_data[['tweet_id','tweet_text','created_date','retweet_count']]
tweets_master_table = tweets_master_table.drop_duplicates()
tweets_master_table.head(3)

Unnamed: 0,tweet_id,tweet_text,created_date,retweet_count
0,1111,Just saw #BlackPanther and now I keep randomly...,Sat Mar 17 20:41:14 +0000 2018,0
1,1112,RT @WakaFlocka: We Got Him OUT!!! #DanielKaluu...,Sat Mar 17 20:41:13 +0000 2018,96
2,1113,RT @GeeksOfColor: The Dora Milaje Take Center ...,Sat Mar 17 20:41:00 +0000 2018,281


In [44]:
#twitter user table - this links the user with tweets
twitter_user_table = pd.DataFrame()
twitter_user_table[['user_id','user_name']] = tweets_data[['user_id','screen_name']]
twitter_user_table = twitter_user_table.drop_duplicates()
twitter_user_table.head(3)

Unnamed: 0,user_id,user_name
0,121211,happyhealthyacw
1,121212,ClintonS_anchez
2,121213,blackaqualad


## Creating Mappings between all the tables

Here, we create seperate mapping tables between the database tables in order to perform join statements.


In [15]:
#movie-director mapping
movie_director_maping = pd.DataFrame()
movie_director_maping[['movie_id','director_id']] = movie_data[['Movie_id','director_id']]
movie_director_maping = movie_director_maping.drop_duplicates()
movie_director_maping.head(3)

Unnamed: 0,movie_id,director_id
0,635054000000000.0,1350
16,1071120000000000.0,1351
32,372742000000000.0,1352


In [16]:
#movie-starcast mapping
movie_starcast_maping = pd.DataFrame()
movie_starcast_maping[['movie_id','starcast_id']] = movie_data[['Movie_id','star_cast_id']]
movie_starcast_maping = movie_starcast_maping.drop_duplicates()
movie_starcast_maping.head(3)

Unnamed: 0,movie_id,starcast_id
0,635054000000000.0,10676
4,635054000000000.0,10677
8,635054000000000.0,10678


In [17]:
#movie-genre mapping
movie_genre_maping = pd.DataFrame()
movie_genre_maping[['movie_id','genre_id']] = movie_data[['Movie_id','genre_id']]
movie_genre_maping = movie_genre_maping.drop_duplicates()
movie_genre_maping.head(3)

Unnamed: 0,movie_id,genre_id
0,635054000000000.0,100
1,635054000000000.0,104
2,635054000000000.0,102


In [18]:
#movie-post mapping
movie_post_maping = pd.DataFrame()
movie_post_maping[['movie_id','post_id']] = post_data[['movie_id','post_id']]
movie_post_maping = movie_post_maping.drop_duplicates()
movie_post_maping.head(3)

Unnamed: 0,movie_id,post_id
0,635000000000000.0,458711740828112_1857840717581867
1,635000000000000.0,458711740828112_1858124704220135
2,635000000000000.0,458711740828112_1857840970915175


In [19]:
#post-comment mapping
post_comment_maping = pd.DataFrame()
post_comment_maping[['post_id','comments_id']] = comment_data[['post_id','comments_id']]
post_comment_maping = post_comment_maping.drop_duplicates()
post_comment_maping.head(3)

Unnamed: 0,post_id,comments_id
0,458711740828112_1857840717581867,1857840717581867_1858341727531766
1,458711740828112_1857840717581867,1857840717581867_1858346307531308
2,458711740828112_1857951664237439,1857951664237439_1858307817535157


In [20]:
#movie-tweets mapping
movie_tweets_mapping = pd.DataFrame()
movie_tweets_mapping[['movie_id','tweet_id']] = tweets_data[['movie_id','tweet_id']]
movie_tweets_mapping = movie_tweets_mapping.drop_duplicates()
movie_tweets_mapping.head(3)

Unnamed: 0,movie_id,tweet_id
0,635000000000000.0,1111
1,635000000000000.0,1112
2,635000000000000.0,1113


In [43]:
#tweets-user mapping
tweets_user_mapping = pd.DataFrame()
tweets_user_mapping[['tweet_id', 'user_id']] = tweets_data[['tweet_id','user_id']]
tweets_user_mapping = tweets_user_mapping.drop_duplicates()
tweets_user_mapping.head(3)

Unnamed: 0,tweet_id,user_id
0,1111,121211
1,1112,121212
2,1113,121213


## Creating databases

Now that we have our schma and the tables for the database ready, let us start creating the tables in the databases and start storing the vales in them.

In [22]:
#creating connections
conn = sqlite3.connect("imdb_movie.db") #creates and connects to a database named "imdb_movie"
c = conn.cursor()

In [24]:
c.execute("""DROP TABLE movie_master_table""")
c.execute("""CREATE TABLE movie_master_table(
movie_id INTEGER PRIMARY KEY,
movie_names CHAR(50),
movie_description VARCHAR(200),
imdb_ratings FLOAT,
metascores INTEGER,
runtime INTEGER,
gross_value INTEGER,
year_release CHAR(4));""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [26]:
movie_master_table.to_sql("movie_master_table", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [27]:
c.execute("""DROP TABLE actor_master_table;""")
c.execute("""CREATE TABLE actor_master_table(
starcast_id INTEGER PRIMARY KEY,
starcast_name CHAR NOT NULL);""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [28]:
starcast_master_table.to_sql("starcast_master_table", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [29]:
c.execute("""DROP TABLE director_master_table;""")
c.execute("""CREATE TABLE director_master_table(
director_id INTEGER PRIMARY KEY,
director_name CHAR NOT NULL);""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [30]:
director_master_table.to_sql("director_master_table", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [31]:
c.execute("""DROP TABLE genre_master_table;""")
c.execute("""CREATE TABLE genre_master_table(
genre_id INTEGER PRIMARY KEY,
genre CHAR NOT NULL);""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [32]:
genre_master_table.to_sql("genre_master_table", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [33]:
c.execute("""DROP TABLE posts_master_table;""")
c.execute("""CREATE TABLE posts_master_table(
post_id CHAR PRIMARY KEY,
created_at DATE,
post_message CHAR,
likes_count INTEGER,
share_count INTEGER,
comments_count INTEGER,
user_engagement INTEGER) ;""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [34]:
posts_master_table.to_sql("posts_master_table", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [35]:
c.execute("""DROP TABLE comments_master_table;""")
c.execute("""CREATE TABLE comments_master_table(
comments_id CHAR PRIMARY KEY,
created_at DATE,
comments CHAR);""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [36]:
comments_master_table.to_sql("comments_master_table", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [37]:
c.execute("""DROP TABLE tweets_master_table;""")
c.execute("""CREATE TABLE tweets_master_table(
tweet_id INTEGER PRIMARY KEY,
tweet_text CHAR,
created_date DATE,
retweet_count INTEGER);""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [38]:
tweets_master_table.to_sql("tweets_master_table", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [45]:
c.execute("""DROP TABLE twitter_user_table;""")
c.execute("""CREATE TABLE twitter_user_table(
user_id INTEGER PRIMARY KEY,
user_name CHAR);""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [46]:
twitter_user_table.to_sql("twitter_user_table", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [66]:
c.execute("""DROP TABLE movie_tags;""")
c.execute("""CREATE TABLE movie_tags(
movie_id INTEGER FOREIOGN KEY,
word CHAR,
score FLOAT);""") #creating a new table within database

<sqlite3.Cursor at 0x22342a34880>

In [67]:
tag_scores.to_sql("movie_tags", conn, if_exists = "append", index = False) #storing dataframe to SQL database

## Creating connection to the database

Here, we are using SQLite database management system in order to create store and analyse our data. we create a relational database schema and store our data inside.


In [30]:
#c.execute("""DROP TABLE movie_director_maping;""")
c.execute("""CREATE TABLE movie_director_maping(
movie_id INTEGER FOREGION KEY,
director_id INTEGER FOREGION KEY)""") #creating a new tables within database

<sqlite3.Cursor at 0x221d3a4f0a0>

In [31]:
movie_director_maping.to_sql("movie_director_maping", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [32]:
#c.execute("""DROP TABLE movie_starcast_maping;""")
c.execute("""CREATE TABLE movie_actor_maping(
movie_id INTEGER FOREGION KEY,
starcast_id INTEGER FOREGION KEY)""") #creating a new tables within database

<sqlite3.Cursor at 0x221d3a4f0a0>

In [33]:
movie_starcast_maping.to_sql("movie_actor_maping", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [34]:
#c.execute("""DROP TABLE movie_genre_maping;""")
c.execute("""CREATE TABLE movie_genre_maping(
movie_id INTEGER FOREGION KEY,
genre_id INTEGER FOREGION KEY)""") #creating a new tables within database

<sqlite3.Cursor at 0x221d3a4f0a0>

In [35]:
movie_genre_maping.to_sql("movie_genre_maping", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [36]:
#c.execute("""DROP TABLE movie_post_maping;""")
c.execute("""CREATE TABLE movie_post_maping(
movie_id INTEGER FOREGION KEY,
post_id CHAR FOREGION KEY)""") #creating a new tables within database

<sqlite3.Cursor at 0x221d3a4f0a0>

In [37]:
movie_post_maping.to_sql("movie_post_maping", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [38]:
#c.execute("""DROP TABLE post_comment_maping;""")
c.execute("""CREATE TABLE post_comment_maping(
post_id CHAR FOREGION KEY,
comments_id CHAR FOREGION KEY)""") #creating a new tables within database

<sqlite3.Cursor at 0x221d3a4f0a0>

In [39]:
post_comment_maping.to_sql("post_comment_maping", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [47]:
#c.execute("""DROP TABLE movie_tweets_mapping;""")
c.execute("""CREATE TABLE movie_tweets_mapping(
movie_id CHAR FOREGION KEY,
tweet_id CHAR FOREGION KEY)""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [48]:
movie_tweets_mapping.to_sql("movie_tweets_mapping", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

In [49]:
#c.execute("""DROP TABLE tweets_user_mapping;""")
c.execute("""CREATE TABLE tweets_user_mapping(
tweet_id CHAR FOREGION KEY,
user_id CHAR FOREGION KEY)""") #creating a new tables within database

<sqlite3.Cursor at 0x22342a34880>

In [50]:
tweets_user_mapping.to_sql("tweets_user_mapping", conn, if_exists = 'append', index = False) #Storing data frame to SQL database

## Finding answers to the following questions using the data that we have collected and modelled.

i. What are people saying about me (somebody)?

    Retrieve the tags associated with a movie.
    
ii. How viral are my posts?
    
    Have my (Official Page of each movie) posts been shared, liked and commented on a lot?
    
iii. How much influence to my posts have?
    
    Which movies Official Page's posts have been shared the most number of times?
    
iv. What posts are like mine?
    
    Posts that use similar tags.
    
v. What users post like me?

    Posts that use similar tags and post about similar topics.

vi. Who should I be following?
    
    Who is posting a lot with the popular tags associated with me(Movie page)? This will increase my user engagement.

vii. What topics are trending in my domain?
       
    What are the most popular terms that people have used in tweets? 
       
viii. What keywords/ hashtags should I add to my post?
    
    What am I posting about? What are the popular tags related to that topic?
    
ix. Should I follow somebody back?
    
    Are we posting about similar topics? Do we have common friends? 

x. What is the best time to post?
    
    Calculate the user engagement, compare with posts posted at different times.
    
xi. Should I add and picture or url to my post?

    Do posts with pictures and URLs get more user engagement?
    
xiii. What’s my reach?
    
    Have users other than my friends liked/ shared/ commented on my posts?

# Question 1:
## What are people saying about me?
 Retrieve the tags associated with a movie.

In [73]:
pd.read_sql_query("""SELECT movie_names, group_concat(word) 
FROM movie_tags as mt, movie_master_table as mmt 
WHERE mmt.movie_id == mt.movie_id GROUP BY movie_names""",conn)

Unnamed: 0,movie_names,group_concat(word)
0,12 Strong,"passes,carload,starlitewichita,fri,open"
1,A Futile and Stupid Gesture,"afutileandstupidgesture,peterprincipato,mikeco..."
2,Acts of Violence,"actsofviolence,brucewillis,therealmikeepps,tra..."
3,Annihilation,"annihilation,bears,dmeishappy,paddington,netflix"
4,Black Panther,"blackpanther,theblackpanther,marvelstudios,wor..."
5,Braven,"braven,umbc,towson,beat,uva"
6,Damsel,"sinazomanazo,damsel,yoh,peacehochub,creating"
7,Den of Thieves,"denofthieves,l,gerardbutler,outlaws,power"
8,"Don't Worry, He Won't Get Far on Foot",
9,Early Man,"earlyman,aardman,thesimpleparent,visa,gc"


# Question 2:

## What topics are trending?

We are answring this question by searching through the scores of all the tags in relation to the text from all the other movies.

In [83]:
pd.read_sql_query("""SELECT word,score
FROM movie_tags as mt ORDER BY mt.score DESC LIMIT 7
""",conn)

Unnamed: 0,word,score
0,,2.456736
1,,2.456736
2,mazerunnermovie,0.301284
3,dylanobrien,0.301284
4,sangsterthomas,0.301284
5,kickboxer,0.173467
6,sinazomanazo,0.168365


# Question 3:

## How viral are my posts?

Have my (Official Page of each movie) posts been shared, liked and commented on a lot on Facebook?

In [95]:
pd.read_sql_query(""" SELECT movie_names, SUM(user_engagement) as viral
FROM movie_master_table mmt
JOIN movie_post_maping mpm ON mmt.movie_id = mpm.movie_id
JOIN posts_master_table pmt ON mpm.post_id = pmt.post_id
GROUP BY mmt.movie_names
ORDER BY viral DESC;""", conn).head()

Unnamed: 0,movie_names,viral
0,Black Panther,129103
1,12 Strong,115511
2,Den of Thieves,82694
3,Winchester,66585
4,Padmaavat,61020


# Question 4:

## How much influence do my posts have?

Which movies Official Page's posts have been shared the most number of times?

In [94]:
pd.read_sql_query(""" SELECT movie_names, SUM(share_count) as total_shares
FROM movie_master_table mmt
JOIN movie_post_maping mpm ON mmt.movie_id = mpm.movie_id
JOIN posts_master_table pmt ON mpm.post_id = pmt.post_id
GROUP BY mmt.movie_names
ORDER BY total_shares DESC LIMIT 7;""", conn)

Unnamed: 0,movie_names,total_shares
0,12 Strong,18934
1,Black Panther,15124
2,Winchester,14338
3,Den of Thieves,11991
4,When We First Met,6967
5,Fifty Shades Freed,4803
6,Insidious: The Last Key,4205
