## Promoting Tourism in San Francisco
<p>San Francisco has been home to many famous films, including the action classic “Bullitt” and the recent science-fiction epic “Rise of the Planet of the Apes”. To celebrate the cinematic history of the city, the tourism board has asked you to perform some analyses.</p>
<p>Their idea is to promote the 10 most popular filming locations in San Franciso. The board plans to create an attraction at each of the 10 locations based on the biggest film (by worldwide income) shot there.</p>
<p>At your disposal are two datasets. One contains every location and film shot in San Franciso. The other dataset contains movie details drawn from the Internet Movie Database (IMDB). </p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:16px"><b>datasets/locations.csv - Filming locations of movies shot in San Francisco since 1924</b>
    </div>
    <div> Source: <a href="https://data.sfgov.org/Culture-and-Recreation/Film-Locations-in-San-Francisco/yitu-d5am">Film Locations in San Francisco</a></div>

<ul>
    <li><b>Title: </b>Title of the movie. Note that some films may share the same title, and are only differentiated by year of release.</li>
    <li><b>Release Year: </b>Year of release in cinemas.</li>
    <li><b>Locations: </b>Name of location in San Francisco where a scene was shot for the movie.</li>
    <li><b>Production Company: </b>Company that produced the film.</li>
    <li><b>Distributor: </b>Company that distributed the film.</li>
</ul>
    </div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6; margin-top: 17px;">
    <div style="font-size:16px"><b>datasets/imdb_movies.csv - Data on over 85,000 movies up to 2020</b>
    </div>
    <div>Source: <a href="https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset">Kaggle (IMDb movies extensive dataset)</a></div>
<ul>
    <li><b>imdb_title_id: </b>Unique film id.</li>
    <li><b>title: </b>Title of the film. Note that some films may share the same title, and are only differentiated by year of release.</li>
    <li><b>year: </b>The year of release.</li> 
    <li><b>genre: </b>The genres of the film. The primary genre of the film is the first genre listed.</li>
    <li><b>duration: </b>The duration of the film in minutes.</li>
    <li><b>director: </b>The name of the director.</li>
    <li><b>actors: </b>The leading actors of the film.</li>
    <li><b>avg_vote: </b>Average review given to the film.</li>
    <li><b>worldwide_gross_income: </b>Total income for the film worldwide in US dollars.</li>
</ul>
    </div>

In [135]:
# Use this cell to begin your analysis, and add as many as you would like!
import pandas as pd
import numpy as np

In [136]:
# 1743 entries
loc_df=pd.read_csv('datasets/locations.csv')

#85854 entries
imdb_df=pd.read_csv('datasets/imdb_movies.csv')


In [137]:
# Count the number of non-duplicates
# No duplicates bcuz same number of entries
#(~loc_df.duplicated()).sum()
#(~imdb_df.duplicated()).sum()

# Count the number of entries by the number of distinct combination of title and year
#imdb_df['nos'] = imdb_df.groupby("title")['year'].transform("nunique")
#print(imdb_df.nos.value_counts())
#e.g. there are a total of 10 movie entries for the same title with 9 different years
#chk = imdb_df.loc[imdb_df['nos'] == 8]
#print(chk.head(20))

In [138]:
loc_rank = loc_df['Locations'].value_counts().rename_axis('Locations').reset_index(name='count')
loc_rank['loc_rank']=loc_rank['count'].rank(ascending=0,method='dense')
#print(loc_rank)

loc_df=pd.merge(loc_df,loc_rank,how='left',on=["Locations"])

In [139]:
loc_df.rename(columns={'Release Year':'Year'},inplace=True)
imdb_df.rename(columns={'title':'Title','year':'Year'},inplace=True)

#86657
all_df=pd.merge(imdb_df,loc_df,how='left',on=["Title","Year"],indicator=True)
#all_df.shape

#print(all_df.loc[all_df['Title'].isin(['Darling','Ava'])])

all_df['currency']=all_df['worldwide_gross_income'].str.split().str[0]
all_df['income']=all_df['worldwide_gross_income'].str.split().str[1].astype(float)

# print(all_df.currency.value_counts())
# $      31614
# INR       58
# NPR        1
# PKR        1
# GBP        1
#chk=all_df.loc[all_df['currency'].isin(['INR','NPR','PKR','GBP'])]
# Zero result
#print(chk.Locations.value_counts())

all_df['primary']=all_df['genre'].str.split(',').str[0]
all_df.loc[(all_df['primary'] == 'Action') | (all_df['primary'] == 'Drama') | (all_df['primary'] == 'Biography'), 'flagprimary'] = 1 
all_df.loc[(all_df['primary'] != 'Action') & (all_df['primary'] != 'Drama') & (all_df['primary'] != 'Biography'), 'flagprimary'] = 0

all_df.loc[all_df['avg_vote'] <= 6, 'more6'] = 0
all_df.loc[all_df['avg_vote'] > 6, 'more6'] = 1

#print(all_df.tail())
#all_df.info()

#print(imdb_df.loc[imdb_df['Title'].isin(['Darling','Ava'])])
#print()
#print()
#print(all_df.loc[all_df['Title'].isin(['Darling','Ava'])])

In [140]:
all_df.sort_values(['loc_rank', 'income'], ascending=[True, False], inplace=True)
#print(all_df.head(15))

# Subset moview of only average review higher than 6. 
# and primary genre is either Action, Drama, or Biography.
all_df.loc[(all_df['flagprimary'] == 1) & (all_df['more6'] == 1), 'subset'] = 1
subset_df=all_df.loc[all_df['subset'] == 1]
#print(subset_df.head(15))

final_df=subset_df.groupby('loc_rank',as_index=False).first()
#print(final_df.head(15))

In [141]:
conditions = [
    (final_df['loc_rank'] == 1),
    (final_df['loc_rank'] == 2),
    (final_df['loc_rank'] == 3),
    (final_df['loc_rank'] == 4),
    (final_df['loc_rank'] == 5),
    (final_df['loc_rank'] == 6),
    (final_df['loc_rank'] == 7),
    (final_df['loc_rank'] == 8),
    (final_df['loc_rank'] == 9),
    (final_df['loc_rank'] == 10)]
choices = ['Most popular location', 'Second most popular location', 'Third most popular location',
           'Fourth most popular location', 'Fifth most popular location', 'Sixth most popular location', 
           'Seventh most popular location', 'Eighth most popular location', 'Nineth most popular location',
           'Tenth most popular location']
final_df['Popularity'] = np.select(conditions, choices, default='Others')
#print(final_df.head(15))

In [142]:
final_df.rename(columns={'Locations':'Location'},inplace=True)
sf_hits=final_df[['Location', 'Title','Year']].iloc[0:10,:] 
sf_hits.shape


(10, 3)