## Topic Distribution by Origin of Movies</h1>

In previous analyses, we studied the topic distribution by origin for popular movies, and we could see some interesting differences in topics. But, due to the limited movies plots for popular movies, interpretation of the results should be limited. So, we are trying to expand the topic distribution by origin for all movies across the world since 2007.
<hr>

In [1]:
from gensim import corpora, models, similarities, matutils
import itertools
import numpy as np
import pandas as pd
from unidecode import unidecode
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [2]:
dropbox = "/Users/mr/Dropbox/moviemeta/"

## LDA topic distribution for IMDB data

In [3]:
imdb_lda = models.LdaModel.load(dropbox +'lda_imdb.model')
imdb_corpus = corpora.MmCorpus(dropbox +'lda_imdb.corpus')
imdb_dict = corpora.Dictionary.load(dropbox +'lda_imdb.dict')
imdb_meta_df = pd.read_csv(dropbox + 'imdb_meta_df.csv')

imdb_topic_matrix = matutils.corpus2dense(imdb_lda[imdb_corpus], num_terms=30, num_docs=len(imdb_corpus))
imdb_topic_df = pd.DataFrame(np.ndarray.transpose(imdb_topic_matrix))
imdb_topic_df = pd.concat([imdb_topic_df, imdb_meta_df], axis=1) 


Now we have created a dataframe with the topic distribution for every movie. We have 30 topics, columns 0 - 29. The values represent how prominently a topic features in a movie.

In [4]:
imdb_topic_df.head()

Unnamed: 0.1,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,Unnamed: 0,title,year
0,0,0,0,0,0.0,0.093122,0,0.0,0,0,...,0.0,0.661517,0.0,0.0,0.0,0,0,0,#1 Cheerleader Camp (2010) (V),2010
1,0,0,0,0,0.037337,0.0,0,0.800593,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,1,#1 Serial Killer (2013),2013
2,0,0,0,0,0.0,0.0,0,0.0,0,0,...,0.0,0.229813,0.0,0.0,0.069425,0,0,2,#1 at the Apocalypse Box Office (2015),2015
3,0,0,0,0,0.0,0.291693,0,0.290445,0,0,...,0.0,0.0,0.035631,0.030289,0.0,0,0,3,#137 (2011),2011
4,0,0,0,0,0.0,0.0,0,0.516573,0,0,...,0.223794,0.0,0.107114,0.0,0.0,0,0,4,#29 (2012),2012


In [5]:
imdb_topic_df.shape #before dropping null data

(259028, 33)

### Adding country of origin from IMDB data

In [None]:
imdb_meta_2007_2015 = imdb_meta_df[imdb_meta_df.year > 2006]

with open(dropbox + "imdb/countries.list") as f:
    countries = f.readlines()
    
imdb_meta_2007_2015['origin'] = pd.Series(index=imdb_meta_2007_2015.index)
for i,movie in enumerate(countries):
    if movie[0] == '"':
        continue
    if i%10000 == 0:
        print i
    split = movie.split('\t')
    title = split[0]
    idx = imdb_meta_2007_2015[imdb_meta_2007_2015['title']== title].index
    #if the title is in our dataframe add the country
    if len(idx > 0):
        imdb_meta_2007_2015.loc[[idx[0]],['origin']]= split[len(split)-1].replace('\n','')
imdb_meta_2007_2015.to_csv(dropbox + 'imdb_meta_2007_2015.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


480000
490000
500000
510000
520000
530000
540000
550000
560000
570000
580000
590000
600000
610000
620000
630000
640000
650000
660000
670000
680000
690000
700000
710000
720000
730000
740000
750000
760000
770000
780000
790000
800000
810000
820000
830000
840000
850000
860000
870000
880000
890000
900000
910000
920000
930000
940000
950000
960000
970000
980000
990000
1000000
1010000
1020000
1030000
1040000
1050000
1060000
1070000
1080000
1090000
1100000
1110000
1120000
1130000
1140000
1150000
1160000
1170000
1180000
1190000
1200000
1210000
1220000
1230000
1240000
1250000
1260000
1270000
1280000
1290000
1300000
1310000
1320000
1330000
1340000
1350000
1360000
1370000
1380000
1390000
1400000
1410000
1420000
1430000
1440000
1450000
1460000
1470000
1480000
1490000

In [6]:
#load imdb_meta_2007_2015 & add origin
imdb_meta_2007_2015 = pd.read_csv(dropbox + 'imdb_meta_2007_2015.csv')
imdb_meta_2007_2015 = imdb_meta_2007_2015.set_index('Unnamed: 0')

imdb_topic_df = pd.concat([imdb_topic_df, imdb_meta_2007_2015[['origin']]], axis=1) #origin & topics merged dataframe

In [7]:
imdb_topic_df.head()

Unnamed: 0.1,0,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,Unnamed: 0,title,year,origin
0,0,0,0,0,0.0,0.093122,0,0.0,0,0,...,0.661517,0.0,0.0,0.0,0,0,0,#1 Cheerleader Camp (2010) (V),2010,USA
1,0,0,0,0,0.037337,0.0,0,0.800593,0,0,...,0.0,0.0,0.0,0.0,0,0,1,#1 Serial Killer (2013),2013,USA
2,0,0,0,0,0.0,0.0,0,0.0,0,0,...,0.229813,0.0,0.0,0.069425,0,0,2,#1 at the Apocalypse Box Office (2015),2015,Australia
3,0,0,0,0,0.0,0.291693,0,0.290445,0,0,...,0.0,0.035631,0.030289,0.0,0,0,3,#137 (2011),2011,Australia
4,0,0,0,0,0.0,0.0,0,0.516573,0,0,...,0.0,0.107114,0.0,0.0,0,0,4,#29 (2012),2012,Netherlands


In [8]:
imdb_topic_df.tail()

Unnamed: 0.1,0,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,Unnamed: 0,title,year,origin
259023,,,,,,,,,,,...,,,,,,,259023,�egar �a� gerist (1998) (TV),1998,
259024,,,,,,,,,,,...,,,,,,,259024,�etta Reddast (2013),2013,Iceland
259025,,,,,,,,,,,...,,,,,,,259025,�r�ng s�n (2005),2005,
259026,,,,,,,,,,,...,,,,,,,259026,�a go�te le ciel (2014),2014,Canada
259027,,,,,,,,,,,...,,,,,,,259027,�l (2001) (V),2001,


As we can see in above dataframe, there are considerable amount of movies without topic score(no plots available) and without origin(origin NaN). So, drop the null data first. 

In [9]:
imdb_topic_df = imdb_topic_df.dropna()
imdb_topic_df.shape

(96996, 34)

In [11]:
#check the number of origins
origins=imdb_topic_df['origin'].unique() 
print len(origins)
origins[:5]

221


['Canada', 'Fiji', 'Turkmenistan', 'Saint Helena', 'Serbia and Montenegro']

In [49]:
#topic score for each origin & total number of movies produced in each origin
topic_score_mean= imdb_topic_df.groupby('origin').mean()
num_movies = imdb_topic_df.groupby('origin')[0].count()#CHANGED

In [23]:
topic_score_sum.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,Unnamed: 0,year
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.803492,1.95582,1.308161,1.66369,0.967186,5.436209,1.887227,5.085461,1.226522,1.18674,...,0.907591,1.856478,3.346011,1.945502,0.483885,0.987426,3.458053,2.065014,5798408,116711
Albania,0.0,0.346023,0.238979,0.745053,0.225118,2.047592,0.549389,1.801471,0.375738,0.35179,...,0.408999,0.120434,0.989995,0.768267,0.332201,0.742103,3.022251,2.064955,2116267,46260
Algeria,0.055477,0.242904,0.242065,0.801711,0.087115,0.534451,0.311442,1.019839,0.443463,0.224321,...,0.628425,0.546775,0.970924,0.853544,0.088895,0.259117,1.345176,0.368288,1460716,30173
American Samoa,0.0,0.0,0.0,0.0,0.031474,0.293284,0.208549,0.0,0.029386,0.0,...,0.112834,0.054964,0.277949,0.016183,0.0,0.094357,0.347808,0.078095,275295,4028
Andorra,0.023438,0.136673,0.0,0.118147,0.136444,0.066255,0.246955,0.0,0.308058,0.275911,...,0.431968,0.198045,0.158952,0.0,0.076872,0.02656,0.418471,0.0,280403,8039


In [25]:
num_movies.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,Unnamed: 0,title,year
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,58,58,58,58,58,58,58,58,58,58,...,58,58,58,58,58,58,58,58,58,58
Albania,23,23,23,23,23,23,23,23,23,23,...,23,23,23,23,23,23,23,23,23,23
Algeria,15,15,15,15,15,15,15,15,15,15,...,15,15,15,15,15,15,15,15,15,15
American Samoa,2,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
Andorra,4,4,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,4


In [96]:
#get number of movies as list
n_movies = num_movies[0].tolist()
n_movies = [x for x in n_movies if x != 0]
from itertools import repeat
total_movies=[x for item in n_movies for x in repeat(item, 30)]

In [55]:
#topic_score_sum = topic_score_sum.reset_index()
#topic_score_mean = topic_score_mean.reset_index()
#WHY?

In [93]:
origin_list=[]
for origin in topic_score['origin'].tolist():
    origin_list.append([origin]*30)
len(origin_list)

221

In [90]:
topic_list = ['magic, myths', 'school, college', 'fantasy, christmas', 'home', 'ships, sailing, pirates', 'love, relationships', 'war', 'exploration, nature, space', 
              'comedy','places, nature, scenery','hollywood, stars', 'society, culture','historical, costumes', 'money, robbery',
              'photography, design','spies, terrorism', 'mixed','town','press, politics', 'crime, police, underworld',
             'documentary, interview', 'friendship, relationships', 'cowboys and indians','night life, enjoyment', 'crime, mistery',  
              'music', 'farming, country side','fantasy, fairy tale', 'love, family', 'gangs, drugs, police']
topics = [topic_list]*len(origins)

In [91]:
df = topic_score_mean.iloc[:, 0:30]
df.head()
t = map(list, df.values) #list of topic scores for each origin
len(t)

221

In [110]:
#make dataframe for visualization on Tableau public
origin_df_new =pd.DataFrame()
num_origins = range(0,len(origins))
for i in num_origins:
    temp = pd.DataFrame({'origin':origin_list[i], 'topic': topics[i] , 'topic_score_mean': t[i]})
    origin_df_new = pd.concat([origin_df_new, temp])


In [111]:
#add number of movies as a column
origin_df_new['num_movies'] = pd.Series(total_movies, index=origin_df_new.index)
origin_df_new.head()

Unnamed: 0,origin,topic,topic_score_mean,num_movies
0,Afghanistan,"magic, myths",0.013853,58
1,Afghanistan,"school, college",0.033721,58
2,Afghanistan,"fantasy, christmas",0.022555,58
3,Afghanistan,home,0.028684,58
4,Afghanistan,"ships, sailing, pirates",0.016676,58


In [112]:
#The num_movies should be divided by number of topics(30), because Tableau public plots the sum of num_movies for each origin.
origin_df_new['num_movies/30'] = origin_df_new['num_movies']/30
origin_df_new.head()

Unnamed: 0,origin,topic,topic_score_mean,num_movies,num_movies/30
0,Afghanistan,"magic, myths",0.013853,58,1.933333
1,Afghanistan,"school, college",0.033721,58,1.933333
2,Afghanistan,"fantasy, christmas",0.022555,58,1.933333
3,Afghanistan,home,0.028684,58,1.933333
4,Afghanistan,"ships, sailing, pirates",0.016676,58,1.933333


In [113]:
#ordering the dataframe by total nubmer of movies
origin_df_new = origin_df_new.sort_values('num_movies', axis=0, ascending=False, inplace=False)
origin_df_new.head()

Unnamed: 0,origin,topic,topic_score_mean,num_movies,num_movies/30
28,USA,"love, family",0.083554,47276,1575.866667
13,USA,"money, robbery",0.038544,47276,1575.866667
27,USA,"fantasy, fairy tale",0.027246,47276,1575.866667
0,USA,"magic, myths",0.00929,47276,1575.866667
1,USA,"school, college",0.03072,47276,1575.866667


In [117]:
#save above dataframe to csv
origin_df_new.to_csv(dropbox + "origin_df_all_new.csv", index=False)

In [100]:
#check the number of movies produced in each origin
num_df = pd.DataFrame({'origin':new_origin, 'total_num':n_movies})
num_df.head()

Unnamed: 0,origin,total_num
0,Afghanistan,58
1,Albania,23
2,Algeria,15
3,American Samoa,2
4,Andorra,4


In [115]:
major_origin = origin_df_new[origin_df_new.num_movies >=100] 
major_origin.head()

Unnamed: 0,origin,topic,topic_score_mean,num_movies,num_movies/30
28,USA,"love, family",0.083554,47276,1575.866667
13,USA,"money, robbery",0.038544,47276,1575.866667
27,USA,"fantasy, fairy tale",0.027246,47276,1575.866667
0,USA,"magic, myths",0.00929,47276,1575.866667
1,USA,"school, college",0.03072,47276,1575.866667


In [118]:
major_origin.to_csv(dropbox + "origin_df_over_100.csv", index=False)