# ***Exercise 1: The Movie Database analysis***
The file `data/tmdb_movies.json` contains a lot of useful information about a ton of movies extracted from The Movie Database (https://www.themoviedb.org/). The JSON file composed by a list of movies with the following information:

```
    {
        budget: Cost (in $),
        genres: A list of movie genres,
        homepage: Movie homepage URL,
        id: Movie ID in this database,
        keywords: Keywords associated with the movie,
        original_language: Language code representing the language in which the movie was created,
        original_title: Original movie title,
        overview: Brief description about the movie,
        popularity: Popularity score,
        production_companies: List of companies that produced the movie,
        production_countries: List of countries in which the movie was produced,
        release_date: Movie release date,
        revenue: Revenue (in $),
        runtime: Moviel length (in minutes),
        spoken_languages: Languages spoken in the movie,
        status: Release status,
        tagline: Tagline,
        title: Title,
        vote_average: Average voting score,
        vote_count: Number of votes
    }
```

Answer the following questions
1. Import the JSON file and count the number of movies
2. Add a new key to the dict called `year`, and store the release year
3. Write a function that prints the following information of each movie `{year}: {original_title}`
```python
    def print_movies(movies: list)
```
4. Write a function that given a year returns those movies released after that year (included). How many movies have been released after 2015?
```python
    def movies_by_year(movies: list, year: int) -> list
```

5. Write a function that given a list of keywords returns all related movies with those keywords
```python
    def movies_by_keywords(movies: list, keywords: set) -> list
```

6. How many movies did `Twentieth Century Fox Film Corporation` participate as producer
7. Get the average rating of all `Fantasy` movies released after 2015 (included)
8. What is the all times most expensive `space_travel` movie?
9. What is the country in which more `romance` movies have been produced?

In [38]:
import json

#### 1. Import the JSON file and count the number of movies

In [54]:
# import the json file
with open("data/tmdb_movies.json") as file:
    movies = json.load(file)

In [57]:
print(f"This file has {len(movies)} movies")

This file has 4800 movies


In [58]:
movies[0]

{'budget': 237000000,
 'genres': [{'id': 28, 'name': 'Action'},
  {'id': 12, 'name': 'Adventure'},
  {'id': 14, 'name': 'Fantasy'},
  {'id': 878, 'name': 'Science Fiction'}],
 'homepage': 'http://www.avatarmovie.com/',
 'id': 19995,
 'keywords': [{'id': 1463, 'name': 'culture clash'},
  {'id': 2964, 'name': 'future'},
  {'id': 3386, 'name': 'space war'},
  {'id': 3388, 'name': 'space colony'},
  {'id': 3679, 'name': 'society'},
  {'id': 3801, 'name': 'space travel'},
  {'id': 9685, 'name': 'futuristic'},
  {'id': 9840, 'name': 'romance'},
  {'id': 9882, 'name': 'space'},
  {'id': 9951, 'name': 'alien'},
  {'id': 10148, 'name': 'tribe'},
  {'id': 10158, 'name': 'alien planet'},
  {'id': 10987, 'name': 'cgi'},
  {'id': 11399, 'name': 'marine'},
  {'id': 13065, 'name': 'soldier'},
  {'id': 14643, 'name': 'battle'},
  {'id': 14720, 'name': 'love affair'},
  {'id': 165431, 'name': 'anti war'},
  {'id': 193554, 'name': 'power relations'},
  {'id': 206690, 'name': 'mind and soul'},
  {'id': 2

#### 2. Add a new key to the dict called year, and store the release year

In [59]:
# option 1 - pick the last 4 characters of the "release data" string

for movie in movies:
    year = movie["release_date"][:4]
    movie["year"] = year

In [83]:
# option 2 - use the datetime module to parse the date and extract the year

import datetime

for movie in movies:
    date = datetime.datetime.strptime(movie["release_date"], "%Y-%m-%dT%H:%M:%S.%fZ")
    movie["year"] = date.year

#### 3. Write a function that prints the following information of each movie `{year}: {original_title}`

In [8]:
def print_movies(movies):
    for movie in movies:
        print(f"{movie['year']}: {movie['original_title']}")

In [9]:
print_movies(movies[:10])

2009: Avatar
2007: Pirates of the Caribbean: At World's End
2015: Spectre
2012: The Dark Knight Rises
2012: John Carter
2007: Spider-Man 3
2010: Tangled
2015: Avengers: Age of Ultron
2009: Harry Potter and the Half-Blood Prince
2016: Batman v Superman: Dawn of Justice


#### 4. Write a function that given a year returns those movies released after that year (included). How many movies have been released after 2015?

In [10]:
def movies_by_year(movies, year):
    year_movies = []
    for movie in movies:
        if int(movie["year"]) >= year:  # we check if the year is greater tha or equal to the year provided (first transform it to integer)
            year_movies.append(movie)
    
    return year_movies

In [11]:
movies_after_2015 = movies_by_year(movies, 2016)

In [12]:
print_movies(movies_after_2015[:10])

2016: Batman v Superman: Dawn of Justice
2016: Captain America: Civil War
2016: Star Trek Beyond
2016: The Legend of Tarzan
2016: X-Men: Apocalypse
2016: Suicide Squad
2016: The Jungle Book
2016: Independence Day: Resurgence
2016: シン・ゴジラ
2016: Alice Through the Looking Glass


In [77]:
# other implementation using the "filter" built-in function

year = 2015
movies_after_2015 = filter(lambda x: int(x["year"]) >= year, movies)

In [78]:
print_movies(list(movies_after_2015)[:10])

2015: Spectre
2015: Avengers: Age of Ultron
2016: Batman v Superman: Dawn of Justice
2016: Captain America: Civil War
2015: Jurassic World
2015: Furious 7
2015: The Good Dinosaur
2016: Star Trek Beyond
2015: Jupiter Ascending
2016: The Legend of Tarzan


#### 5. Write a function that given a list of keywords returns all related movies with those keywords

In [13]:
def movies_by_keywords(movies, keywords):
    keywords = set(keywords)
    kw_movies = []
    for movie in movies:
        kws = [m["name"] for m in movie["keywords"]]
        kws = set(kws)
        
        if keywords.issubset(kws):
            kw_movies.append(movie)
        
    return kw_movies

In [14]:
# other implementation, without the list comprehension

def movies_by_keywords(movies, keywords):
    kw_movies = []
    for movie in movies:
        kws = []
        for m in movie["keywords"]:
            kws.append(m["name"])
        
        if keywords.issubset(kws):
            kw_movies.append(movie)
            
    return kw_movies

In [15]:
# other implementation, with a list that stores if the kw is present or not

def movies_by_keywords(movies, keywords):
    kw_movies = []
    for movie in movies:
        kws = [m["name"] for m in movie["keywords"]]
        
        kw_included = []
        for k in keywords:
            if k in kws:
                kw_included.append(True)
            else:
                kw_included.append(False)
                
        if all(kw_included): #this means all keywords are included
            kw_movies.append(movie)
            
    return kw_movies

In [51]:
# other implementation, with a counter

def movies_by_keywords(movies, keywords):
    kw_movies = []
    for movie in movies:
        kws = [m["name"] for m in movie["keywords"]]
        
        counter = 0
        for k in keywords:
            if k in kws:
                counter += 1
                
        if counter == len(keywords): #this means all keywords are included
            kw_movies.append(movie)
            
    return kw_movies

In [65]:
# other implementation, with filters

def movies_by_keywords(movies, keywords):
    kw_movies = list(filter(lambda x: set(keywords).issubset([kw["name"] for kw in x["keywords"]]), movies))
    return kw_movies

In [52]:
kw_movies = movies_by_keywords(movies, ["space war"])

In [18]:
print_movies(kw_movies)

2009: Avatar
1984: The Ice Pirates


#### 6. How many movies did `Twentieth Century Fox Film Corporation` participate as producer

In [19]:
def movies_by_production_company(movies, company):
    production_movies = []
    for movie in movies:
        production_companies = set([m["name"] for m in movie["production_companies"]])
        
        if company in production_companies:
            production_movies.append(movie)
            
    return production_movies

In [20]:
tcf_movies = movies_by_production_company(movies, "Twentieth Century Fox Film Corporation")

In [22]:
print_movies(tcf_movies[:10])

2009: Avatar
1997: Titanic
2006: X-Men: The Last Stand
2014: X-Men: Days of Future Past
2016: X-Men: Apocalypse
2016: Independence Day: Resurgence
2011: X-Men: First Class
2009: Night at the Museum: Battle of the Smithsonian
2009: X-Men Origins: Wolverine
2016: Kung Fu Panda 3


#### 7. Get the average rating of all `Fantasy` movies released after 2015 (included)

In [23]:
# first, let's implement a function to filter movies by genre

def movies_by_genre(movies, genre):
    genre_movies = []
    for movie in movies:
        genres = set([m["name"] for m in movie["genres"]])
        
        if genre in genres:
            genre_movies.append(movie)
            
    return genre_movies

In [24]:
# sequential calling
movies_2015 = movies_by_year(movies, 2015)
my_movies = movies_by_genre(movies_2015, "Fantasy")

In [25]:
# the same, but in just one line
my_movies = movies_by_year(movies_by_genre(movies, "Fantasy"), 2015)

In [26]:
# extract all the ratings
ratings = [m["vote_average"] for m in my_movies]

In [27]:
# calculate hte average
avg_rating = sum(ratings)/len(ratings)
print(f"The average rating of Fantasy movies released after 2015 is {round(avg_rating,2)}")

The average rating of Fantasy movies released after 2015 is 6.1


#### 8. What is the all times most expensive `space travel` movie?

In [29]:
st_movies = movies_by_keywords(movies, ["space travel"])

In [30]:
most_expensive = sorted(st_movies, key=lambda x: x["budget"], reverse=True)[0]
print(f"The most expensive space travel movie is {most_expensive['original_title']}")

The most expensive space travel movie is John Carter


In [31]:
# the same, but with the max function

most_expensive = max(st_movies, key=lambda x: x["budget"])
print(f"The most expensive space travel movie is {most_expensive['original_title']}")

The most expensive space travel movie is John Carter


#### 9. What is the country in which more `romance` movies have been produced?

In [87]:
romance_movies = movies_by_keywords(movies, ["romance"])

In [88]:
countries = []
for movie in romance_movies:
    production_countries = [m["name"] for m in movie["production_countries"]]
    countries.extend(production_countries)

In [34]:
unique_countries = set(countries)

In [35]:
for country in unique_countries:
    count = countries.count(country)
    print(f"{country}: {count}")

South Africa: 1
Italy: 2
Czech Republic: 1
United States of America: 18
United Kingdom: 4
Spain: 1
Germany: 2
France: 3


# ***Exercise 2: Twitter data and the OS***

We will work with Twitter data again

 1. Using the `os` module, create a folder called `TWEETS` in the root directory
 2. For each tweet in `twitter_data.json` create a subfolder inside `TWEETS` with the tweet creation date as the folder name
 3. Save all tweets (in json format) in its corresponding folder according to the creation date. Use the tweet ID as the file name.

_NOTE: You cannot use the file system explorer to create folders, you have to use the `os` module instead_

In [4]:
import os 
import json

In [2]:
# 1. Using the os module, create a folder called TWEETS in the root directory
os.mkdir("TWEETS")

In [5]:
# 2. For each tweet in twitter_data.json create a subfolder inside TWEETS with the tweet creation date as the folder name
# 3. Save all tweets (in json format) in its corresponding folder according to the creation date. Use the tweet ID as the file name.

# load twitter data
with open("data/twitter_data.json") as file:
    twitter = json.load(file)
    
for tw in twitter:
    creation_date = tw["created_at"][:10]
    
    # create the folders
    path = os.path.join("TWEETS",creation_date)
    os.makedirs(path, exist_ok=True)
    
    # save the files
    name = tw["id"] # get the id (this will be the file name)
    filepath = os.path.join(path,f"{name}.json")
    with open(filepath, "w") as file:
        json.dump(tw, file)