Handling big data and Data Scraping:
The first step in processing data is getting that data. Your best case scenario is where you are
given that data. But quite often, the data you need is not given -- or even if it is, it&#39;s not in the
format you want.
In this module, you will learn:
● How to scrape data from external sources
● How to parse and transform it into a format you need

**Tasks**

**1.**
Download the IMDb Movie Ratings
source: https://www.imdb.com/interfaces/
1. find the 20 most popular movies with a rank more than 8.0
2. find the 20 best rated movies with over 40,000 votes in the 2000s (year &gt;= 2000)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
pip install pdfcrowd

Collecting pdfcrowd
  Downloading https://files.pythonhosted.org/packages/30/36/00c7cb012f47bdc351471836b82899cfe1d5b559f47b2ba752af94559353/pdfcrowd-5.1.3-py2.py3-none-any.whl
Installing collected packages: pdfcrowd
Successfully installed pdfcrowd-5.1.3


In [3]:
#In this scenario, we have already had given the data, we just have to import it.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import io
import base64
import pdfcrowd

In [9]:
#now, reading the csv files.
names = pd.read_csv("/content/drive/MyDrive/DADV/datasets/name.basics.tsv/data.tsv",sep="\t")
movie = pd.read_csv("/content/drive/MyDrive/DADV/datasets/title.basics.tsv/data.tsv",sep="\t")
ratings = pd.read_csv("/content/drive/MyDrive/DADV/datasets/title.ratings.tsv/data.tsv",sep="\t")

  interactivity=interactivity, compiler=compiler, result=result)


In [8]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1702
1,tt0000002,6.1,210
2,tt0000003,6.5,1458
3,tt0000004,6.2,123
4,tt0000005,6.2,2260


In [10]:
movie.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [11]:
#now selecting average rating > 8 from ratings dataframe
rating8 = ratings[ratings['averageRating']> 8.0]
print(rating8)

            tconst  averageRating  numVotes
275      tt0000361            8.1        10
300      tt0000417            8.2     45079
611      tt0000961            8.2        13
691      tt0001089            8.3        19
748      tt0001233            8.6         5
...            ...            ...       ...
1151272  tt9916380            9.0       104
1151275  tt9916460            9.3        15
1151277  tt9916538            8.3         6
1151280  tt9916578            8.1        25
1151282  tt9916628            8.6         5

[226776 rows x 3 columns]


In [12]:
#in ratings table, there are no titles or movie names, so we have to combine tables.
#as question says most popular, the ratings shall be more.
sorted_votes = rating8.sort_values('numVotes',ascending=False).head(20)
print(sorted_votes)

           tconst  averageRating  numVotes
81976   tt0111161            9.3   2388396
247653  tt0468569            9.0   2351083
601435  tt1375666            8.8   2109738
98362   tt0137523            8.8   1889079
81759   tt0110912            8.9   1859734
80886   tt0109830            8.8   1847836
426770  tt0944947            9.3   1806734
96237   tt0133093            8.7   1707615
89756   tt0120737            8.8   1690747
113309  tt0167260            8.9   1669954
45803   tt0068646            9.2   1655077
390485  tt0816692            8.6   1554375
591030  tt1345836            8.4   1542645
113310  tt0167261            8.7   1510296
416866  tt0903747            9.4   1506486
84528   tt0114369            8.6   1474459
696404  tt1853728            8.4   1386528
115750  tt0172495            8.5   1364063
207590  tt0372784            8.2   1330476
75174   tt0102926            8.6   1296250


In [13]:
#now merging the tables, on common column i.e "tconst"

merged = pd.merge(sorted_votes,movie, on='tconst')
print(merged)

       tconst  averageRating  ...  runtimeMinutes                   genres
0   tt0111161            9.3  ...             142                    Drama
1   tt0468569            9.0  ...             152       Action,Crime,Drama
2   tt1375666            8.8  ...             148  Action,Adventure,Sci-Fi
3   tt0137523            8.8  ...             139                    Drama
4   tt0110912            8.9  ...             154              Crime,Drama
5   tt0109830            8.8  ...             142            Drama,Romance
6   tt0944947            9.3  ...              57   Action,Adventure,Drama
7   tt0133093            8.7  ...             136            Action,Sci-Fi
8   tt0120737            8.8  ...             178   Action,Adventure,Drama
9   tt0167260            8.9  ...             201   Action,Adventure,Drama
10  tt0068646            9.2  ...             175              Crime,Drama
11  tt0816692            8.6  ...             169   Adventure,Drama,Sci-Fi
12  tt1345836            

In [14]:
#Now the title table conists of movies and also tv series.
movie_type = merged.loc[merged['titleType'] == 'movie']

In [15]:
title = movie_type[['originalTitle','averageRating']].head(20)

In [16]:
title = movie_type[['originalTitle','averageRating']].head(20)
print(title)

                                        originalTitle  averageRating
0                            The Shawshank Redemption            9.3
1                                     The Dark Knight            9.0
2                                           Inception            8.8
3                                          Fight Club            8.8
4                                        Pulp Fiction            8.9
5                                        Forrest Gump            8.8
7                                          The Matrix            8.7
8   The Lord of the Rings: The Fellowship of the Ring            8.8
9       The Lord of the Rings: The Return of the King            8.9
10                                      The Godfather            9.2
11                                       Interstellar            8.6
12                              The Dark Knight Rises            8.4
13              The Lord of the Rings: The Two Towers            8.7
15                                

In [17]:
title.sort_values('averageRating',ascending=False)

Unnamed: 0,originalTitle,averageRating
0,The Shawshank Redemption,9.3
10,The Godfather,9.2
1,The Dark Knight,9.0
4,Pulp Fiction,8.9
9,The Lord of the Rings: The Return of the King,8.9
2,Inception,8.8
3,Fight Club,8.8
5,Forrest Gump,8.8
8,The Lord of the Rings: The Fellowship of the Ring,8.8
7,The Matrix,8.7


In [20]:
title.to_csv('/content/drive/MyDrive/DADV/Output_files/popular_movies.csv')

# 2. find the 20 best rated movies with over 40,000 votes in the 2000s (year >= 2000)

In [21]:
#first knowing what type of data we have in movies table.
movie.dtypes

tconst            object
titleType         object
primaryTitle      object
originalTitle     object
isAdult           object
startYear         object
endYear           object
runtimeMinutes    object
genres            object
dtype: object

We have year as object, so we have to convert it to int or float, to compare > 2000.

In [22]:
movie['startYear'] = pd.to_numeric(movie['startYear'], errors='coerce')

In [23]:
movie.dtypes

tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult            object
startYear         float64
endYear            object
runtimeMinutes     object
genres             object
dtype: object

In [25]:
year = movie[movie.startYear >= 2000]
print(year)

            tconst  titleType  ... runtimeMinutes                   genres
11060    tt0011216      movie  ...             67                    Drama
11637    tt0011801      movie  ...             \N             Action,Crime
15181    tt0015414      movie  ...             60                       \N
16659    tt0016906      movie  ...             80           Comedy,Musical
18034    tt0018295      short  ...             40       Action,Drama,Short
...            ...        ...  ...            ...                      ...
7899940  tt9916848  tvEpisode  ...             \N      Action,Drama,Family
7899941  tt9916850  tvEpisode  ...             \N      Action,Drama,Family
7899942  tt9916852  tvEpisode  ...             \N      Action,Drama,Family
7899943  tt9916856      short  ...             27                    Short
7899944  tt9916880  tvEpisode  ...             10  Animation,Comedy,Family

[5226568 rows x 9 columns]


As movie table does not have ratings in movie table, we have to merge ratings table and movie table

In [26]:
year_rating = pd.merge(year,ratings, on='tconst')
print(year_rating)

           tconst  titleType  ... averageRating numVotes
0       tt0011216      movie  ...           6.3       23
1       tt0015414      movie  ...           5.4       11
2       tt0016906      movie  ...           5.6       15
3       tt0018295      short  ...           6.6       31
4       tt0019996      movie  ...           6.3       52
...           ...        ...  ...           ...      ...
777769  tt9916682  tvEpisode  ...           5.6        5
777770  tt9916690  tvEpisode  ...           6.6        5
777771  tt9916720      short  ...           6.4       80
777772  tt9916766  tvEpisode  ...           6.9       16
777773  tt9916778  tvEpisode  ...           7.5       27

[777774 rows x 11 columns]


now we got the movies in year > 2000, now we want the votes > 4000

In [27]:
votes = year_rating[year_rating.numVotes > 40000]
print(votes)

           tconst     titleType  ... averageRating numVotes
5       tt0035423         movie  ...           6.4    79769
59      tt0118694         movie  ...           8.1   129564
92      tt0120630         movie  ...           7.0   179248
93      tt0120667         movie  ...           5.7   314330
95      tt0120679         movie  ...           7.4    82547
...           ...           ...  ...           ...      ...
774183  tt9784798         movie  ...           7.5    45678
774965  tt9815454  tvMiniSeries  ...           8.0    61837
776325  tt9866072         movie  ...           6.1    48685
777027  tt9893250         movie  ...           6.3    94836
777424  tt9906260     tvEpisode  ...           9.9    60042

[3291 rows x 11 columns]


As the question stated the ratings should be best rated.

In [29]:
sorted_ratings = votes.sort_values('averageRating',ascending = False)
print(sorted_ratings)

            tconst  titleType  ... averageRating numVotes
410700   tt2301451  tvEpisode  ...          10.0   136543
777424   tt9906260  tvEpisode  ...           9.9    60042
279189  tt13857684  tvEpisode  ...           9.9    52771
410702   tt2301455  tvEpisode  ...           9.9    93121
396712   tt2178784  tvEpisode  ...           9.9    93324
...            ...        ...  ...           ...      ...
24863    tt0362165      movie  ...           2.2    53150
224074   tt1213644      movie  ...           1.9    87811
335599   tt1702443      movie  ...           1.6    75114
624039   tt5988370      movie  ...           1.4    72610
710002   tt7886848      movie  ...           1.1    65826

[3291 rows x 11 columns]


now we want only movies table not unneccesary columns

In [31]:
movies = sorted_ratings.loc[sorted_ratings['titleType'] == 'movie'].head(20)
print(movies)

            tconst titleType  ... averageRating numVotes
436103   tt2592910     movie  ...           9.2    44217
144827  tt10189514     movie  ...           9.1    60309
54917    tt0468569     movie  ...           9.0  2351083
472      tt0167260     movie  ...           8.9  1669954
99       tt0120737     movie  ...           8.8  1690747
276349   tt1375666     movie  ...           8.8  2109738
614385   tt5813916     movie  ...           8.8   105910
473      tt0167261     movie  ...           8.7  1510296
3721     tt0245429     movie  ...           8.6   669152
661934   tt6751668     movie  ...           8.6   603669
16565    tt0317248     movie  ...           8.6   708535
105571   tt0816692     movie  ...           8.6  1554375
59387    tt0482571     movie  ...           8.5  1213484
4613     tt0253474     movie  ...           8.5   747243
532      tt0172495     movie  ...           8.5  1364063
331761   tt1675434     movie  ...           8.5   777070
289809   tt1424432     movie  .

In [32]:
best_movies = movies[['originalTitle','numVotes','startYear','averageRating']].head(20)
print(best_movies)

                                            originalTitle  ...  averageRating
436103                             CM101MMXI Fundamentals  ...            9.2
144827                                    Soorarai Pottru  ...            9.1
54917                                     The Dark Knight  ...            9.0
472         The Lord of the Rings: The Return of the King  ...            8.9
99      The Lord of the Rings: The Fellowship of the Ring  ...            8.8
276349                                          Inception  ...            8.8
614385                                             Dag II  ...            8.8
473                 The Lord of the Rings: The Two Towers  ...            8.7
3721                        Sen to Chihiro no kamikakushi  ...            8.6
661934                                       Gisaengchung  ...            8.6
16565                                      Cidade de Deus  ...            8.6
105571                                       Interstellar  ...  

In [33]:
best_movies.sort_values('numVotes',ascending=False)

Unnamed: 0,originalTitle,numVotes,startYear,averageRating
54917,The Dark Knight,2351083,2008.0,9.0
276349,Inception,2109738,2010.0,8.8
99,The Lord of the Rings: The Fellowship of the Ring,1690747,2001.0,8.8
472,The Lord of the Rings: The Return of the King,1669954,2003.0,8.9
105571,Interstellar,1554375,2014.0,8.6
473,The Lord of the Rings: The Two Towers,1510296,2002.0,8.7
532,Gladiator,1364063,2000.0,8.5
59387,The Prestige,1213484,2006.0,8.5
37159,The Departed,1211118,2006.0,8.5
331761,Intouchables,777070,2011.0,8.5


In [34]:
title.to_csv('/content/drive/MyDrive/DADV/Output_files/movies_with_high_votes.csv')