In [2]:
import pandas as pd

### Read data from the Excel file
Use the pandas ``read_excel`` method to read in data from the Excel file. Excel files quite often have multiple sheets and the ability to read a specific sheet or all of them is very important. To make this easy, the pandas read_excel method takes an argument called sheetname that tells pandas which sheet to read in the data from. For this, you can either use the sheet name or the sheet number. Sheet numbers start with zero. If the sheetname argument is not given, it defaults to zero and pandas will import the first sheet.


In [4]:
movies_0 = pd.read_excel('dataset/movies.xls',sheet_name=0)

In [5]:
movies_1 = pd.read_excel('dataset/movies.xls',sheet_name=1)

In [6]:
movies_2 = pd.read_excel('dataset/movies.xls',sheet_name=2)

In [7]:
movies_0.shape

(1338, 25)

In [8]:
movies_1.shape

(2100, 25)

In [9]:
movies_2.shape

(1604, 25)

Since all the three sheets have similar data but for different records movies, we will create a single DataFrame from all the three DataFrames we created above. We will use the pandas ``concat`` method for this and pass in the names of the three DataFrames we just created and assign the results to a new DataFrame object, movies.

In [10]:
movies = pd.concat([movies_0,movies_1,movies_2])

In [11]:
movies.shape

(5042, 25)

In [17]:
movies_0.shape

(1338, 25)

In [18]:
movies_1.shape

(2100, 25)

In [19]:
movies_2.shape

(1604, 25)

### Exploring the data
We can use ``head`` method to print top few rows of the dataset. We can use the ``shape`` method to find out the number of rows and columns for the DataFrame.


In [12]:
movies.shape

(5042, 25)

In [13]:
movies.head()

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,Intolerance: Love's Struggle Throughout the Ages,1916.0,Drama|History|War,,USA,Not Rated,123.0,1.33,385907.0,,...,436.0,22.0,9.0,481,691,1.0,10718,88.0,69.0,8.0
1,Over the Hill to the Poorhouse,1920.0,Crime|Drama,,USA,,110.0,1.33,100000.0,3000000.0,...,2.0,2.0,0.0,4,0,1.0,5,1.0,1.0,4.8
2,The Big Parade,1925.0,Drama|Romance|War,,USA,Not Rated,151.0,1.33,245000.0,,...,81.0,12.0,6.0,108,226,0.0,4849,45.0,48.0,8.3
3,Metropolis,1927.0,Drama|Sci-Fi,German,Germany,Not Rated,145.0,1.33,6000000.0,26435.0,...,136.0,23.0,18.0,203,12000,1.0,111841,413.0,260.0,8.3
4,Pandora's Box,1929.0,Crime|Drama|Romance,German,Germany,Not Rated,110.0,1.33,,9950.0,...,426.0,20.0,3.0,455,926,1.0,7431,84.0,71.0,8.0


### Getting statistical information about the data
Pandas has some very handy methods to look at the statistical data about our data set. For example, we can use the ``describe`` method to get a statistical summary of the data set.

In [31]:
movies.describe()

Unnamed: 0,Year,Duration,Aspect Ratio,Budget,Gross Earnings,Facebook Likes - Director,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
count,4935.0,5028.0,4714.0,4551.0,4159.0,4938.0,5035.0,5029.0,5020.0,5042.0,5042.0,5029.0,5042.0,5022.0,4993.0,5042.0
mean,2002.470517,107.201074,2.220403,39752620.0,48468410.0,686.621709,6561.323932,1652.080533,645.009761,9700.959143,7527.45716,1.371446,83684.75,272.770808,140.194272,6.442007
std,12.474599,25.197441,1.385113,206114900.0,68452990.0,2813.602405,15021.977635,4042.774685,1665.041728,18165.101925,19322.070537,2.013683,138494.0,377.982886,121.601675,1.125189
min,1916.0,7.0,1.18,218.0,162.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,1.0,1.0,1.6
25%,1999.0,93.0,1.85,6000000.0,5340988.0,7.0,614.5,281.0,133.0,1411.25,0.0,0.0,8599.25,65.0,50.0,5.8
50%,2005.0,103.0,2.35,20000000.0,25517500.0,49.0,988.0,595.0,371.5,3091.0,166.0,1.0,34371.0,156.0,110.0,6.6
75%,2011.0,118.0,2.35,45000000.0,62309440.0,194.75,11000.0,918.0,636.0,13758.75,3000.0,2.0,96347.0,326.0,195.0,7.2
max,2016.0,511.0,16.0,12215500000.0,760505800.0,23000.0,640000.0,137000.0,23000.0,656730.0,349000.0,43.0,1689764.0,5060.0,813.0,9.5


### Check missing values


In [15]:
movies.columns

Index(['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating',
       'Duration', 'Aspect Ratio', 'Budget', 'Gross Earnings', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3', 'Facebook Likes - Director',
       'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2',
       'Facebook Likes - Actor 3', 'Facebook Likes - cast Total',
       'Facebook likes - Movie', 'Facenumber in posters', 'User Votes',
       'Reviews by Users', 'Reviews by Crtiics', 'IMDB Score'],
      dtype='object')

In [16]:
movies.isna().sum()

Title                            0
Year                           107
Genres                           0
Language                        11
Country                          4
Content Rating                 302
Duration                        14
Aspect Ratio                   328
Budget                         491
Gross Earnings                 883
Director                       104
Actor 1                          7
Actor 2                         13
Actor 3                         22
Facebook Likes - Director      104
Facebook Likes - Actor 1         7
Facebook Likes - Actor 2        13
Facebook Likes - Actor 3        22
Facebook Likes - cast Total      0
Facebook likes - Movie           0
Facenumber in posters           13
User Votes                       0
Reviews by Users                20
Reviews by Crtiics              49
IMDB Score                       0
dtype: int64

In [17]:
movies.dropna(inplace=True)

In [18]:
movies.shape

(3771, 25)

In [19]:
movies.isna().sum()

Title                          0
Year                           0
Genres                         0
Language                       0
Country                        0
Content Rating                 0
Duration                       0
Aspect Ratio                   0
Budget                         0
Gross Earnings                 0
Director                       0
Actor 1                        0
Actor 2                        0
Actor 3                        0
Facebook Likes - Director      0
Facebook Likes - Actor 1       0
Facebook Likes - Actor 2       0
Facebook Likes - Actor 3       0
Facebook Likes - cast Total    0
Facebook likes - Movie         0
Facenumber in posters          0
User Votes                     0
Reviews by Users               0
Reviews by Crtiics             0
IMDB Score                     0
dtype: int64

#### Sorting the dataset

In Excel, you’re able to sort a sheet based on the values in one or more columns. In pandas, you can do the same thing with the `sort_values` method. For example, let’s sort our movies DataFrame based on the ``Budget`` column.


In [22]:
movies.head()

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,Intolerance: Love's Struggle Throughout the Ages,1916.0,Drama|History|War,,USA,Not Rated,123.0,1.33,385907.0,,...,436.0,22.0,9.0,481,691,1.0,10718,88.0,69.0,8.0
1,Over the Hill to the Poorhouse,1920.0,Crime|Drama,,USA,,110.0,1.33,100000.0,3000000.0,...,2.0,2.0,0.0,4,0,1.0,5,1.0,1.0,4.8
2,The Big Parade,1925.0,Drama|Romance|War,,USA,Not Rated,151.0,1.33,245000.0,,...,81.0,12.0,6.0,108,226,0.0,4849,45.0,48.0,8.3
3,Metropolis,1927.0,Drama|Sci-Fi,German,Germany,Not Rated,145.0,1.33,6000000.0,26435.0,...,136.0,23.0,18.0,203,12000,1.0,111841,413.0,260.0,8.3
4,Pandora's Box,1929.0,Crime|Drama|Romance,German,Germany,Not Rated,110.0,1.33,,9950.0,...,426.0,20.0,3.0,455,926,1.0,7431,84.0,71.0,8.0


In [22]:
movies.sort_values(['IMDB Score','Budget'], ascending=False)

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
742,The Shawshank Redemption,1994.0,Crime|Drama,English,USA,R,142.0,1.85,25000000.0,28341469.0,...,11000.0,745.0,461.0,13495,108000,0.0,1689764,4144.0,199.0,9.3
178,The Godfather,1972.0,Crime|Drama,English,USA,R,175.0,1.85,6000000.0,134821952.0,...,14000.0,10000.0,3000.0,28122,43000,1.0,1155770,2238.0,208.0,9.2
1774,The Dark Knight,2008.0,Action|Crime|Drama|Thriller,English,USA,PG-13,152.0,2.35,185000000.0,533316061.0,...,23000.0,13000.0,11000.0,57802,37000,0.0,1676169,4667.0,645.0,9.0
192,The Godfather: Part II,1974.0,Crime|Drama,English,USA,R,220.0,1.85,13000000.0,57300000.0,...,22000.0,14000.0,3000.0,39960,14000,1.0,790926,650.0,149.0,9.0
707,The Lord of the Rings: The Return of the King,2003.0,Action|Adventure|Drama|Fantasy,English,USA,PG-13,192.0,2.35,94000000.0,377019252.0,...,5000.0,857.0,416.0,6434,16000,2.0,1215718,3189.0,328.0,8.9
676,Schindler's List,1993.0,Biography|Drama|History,English,USA,R,185.0,1.85,22000000.0,96067179.0,...,14000.0,795.0,212.0,15233,41000,0.0,865020,1273.0,174.0,8.9
723,Pulp Fiction,1994.0,Crime|Drama,English,USA,R,178.0,2.35,8000000.0,107930000.0,...,13000.0,902.0,857.0,16557,45000,1.0,1324680,2195.0,215.0,8.9
120,"The Good, the Bad and the Ugly",1966.0,Western,Italian,Italy,Approved,142.0,2.35,1200000.0,6100000.0,...,16000.0,34.0,24.0,16089,20000,3.0,503509,780.0,181.0,8.9
79,Inception,2010.0,Action|Adventure|Sci-Fi|Thriller,English,USA,PG-13,148.0,2.35,160000000.0,292568851.0,...,29000.0,27000.0,23000.0,81115,175000,0.0,1468200,2803.0,642.0,8.8
329,The Lord of the Rings: The Fellowship of the R...,2001.0,Action|Adventure|Drama|Fantasy,English,New Zealand,PG-13,171.0,2.35,93000000.0,313837577.0,...,16000.0,5000.0,857.0,22342,21000,2.0,1238746,5060.0,297.0,8.8


In [29]:
sorted_by_budget.head(10)

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
1367,The Host,2006.0,Comedy|Drama|Horror|Sci-Fi,Korean,South Korea,R,110.0,1.85,12215500000.0,2201412.0,...,629.0,398.0,74.0,1173,7000,0.0,68883,279.0,363.0,7.0
1039,Lady Vengeance,2005.0,Crime|Drama,Korean,South Korea,R,112.0,2.35,4200000000.0,211667.0,...,717.0,126.0,38.0,907,4000,0.0,53508,131.0,202.0,7.7
999,Fateless,2005.0,Drama|Romance|War,Hungarian,Hungary,R,134.0,2.35,2500000000.0,195888.0,...,9.0,2.0,0.0,11,607,0.0,5603,45.0,73.0,7.1
986,Princess Mononoke,1997.0,Adventure|Animation|Fantasy,Japanese,Japan,PG-13,134.0,1.85,2400000000.0,2298191.0,...,893.0,851.0,745.0,2710,11000,0.0,221552,570.0,174.0,8.4
885,Steamboy,2004.0,Action|Adventure|Animation|Family|Sci-Fi|Thriller,Japanese,Japan,PG-13,103.0,1.85,2127520000.0,410388.0,...,488.0,336.0,101.0,991,973,1.0,13727,79.0,105.0,6.9
490,Akira,1988.0,Action|Animation|Sci-Fi,Japanese,Japan,R,124.0,1.85,1100000000.0,439162.0,...,6.0,5.0,4.0,28,0,0.0,106160,430.0,150.0,8.1
1236,Godzilla 2000,1999.0,Action|Adventure|Drama|Sci-Fi|Thriller,Japanese,Japan,PG,99.0,2.35,1000000000.0,10037390.0,...,43.0,3.0,3.0,53,339,0.0,5442,140.0,107.0,6.0
1129,Tango,1998.0,Drama|Musical,Spanish,Spain,PG-13,115.0,2.0,700000000.0,1687311.0,...,341.0,26.0,4.0,371,539,3.0,2412,40.0,35.0,7.2
1272,Kabhi Alvida Naa Kehna,2006.0,Drama,Hindi,India,R,193.0,2.35,700000000.0,3275443.0,...,8000.0,1000.0,860.0,10822,659,2.0,13998,264.0,20.0,6.0
90,Kites,2010.0,Action|Drama|Romance|Thriller,English,India,,90.0,,600000000.0,1602466.0,...,594.0,412.0,303.0,1836,0,0.0,9673,106.0,41.0,6.0


### Data Selection – Based on Conditional Filtering
Pandas also enable retrieving data from dataframe based on conditional filters.

What if we want to pick only movies that are released from 2010 to 2016, have a rating of less than 6.0?

In [27]:
subset_2010 = movies[(movies.Year>=2010) & (movies.Year <= 2016) & (movies["IMDB Score"] <= 6.0)]

In [28]:
subset_2010

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
8,Alpha and Omega,2010.0,Adventure|Animation|Comedy|Family|Romance,English,USA,PG,90.0,1.85,20000000.0,25077977.0,...,681.0,611.0,518.0,2486,0,0.0,10986,84.0,84.0,5.3
25,Ca$h,2010.0,Comedy|Crime|Thriller,English,USA,R,118.0,1.85,7000000.0,46451.0,...,26000.0,854.0,410.0,27756,694,2.0,7663,38.0,27.0,6.0
27,Cats & Dogs: The Revenge of Kitty Galore,2010.0,Action|Comedy|Family|Fantasy,English,USA,PG,82.0,1.85,85000000.0,43575716.0,...,975.0,760.0,615.0,3326,0,0.0,10233,63.0,91.0,4.3
31,Clash of the Titans,2010.0,Action|Adventure|Fantasy,English,USA,PG-13,106.0,2.35,125000000.0,163192114.0,...,14000.0,1000.0,850.0,18003,15000,0.0,229679,637.0,344.0,5.8
32,Clash of the Titans,2010.0,Action|Adventure|Fantasy,English,USA,PG-13,106.0,2.35,125000000.0,163192114.0,...,14000.0,1000.0,850.0,18003,15000,0.0,229687,637.0,344.0,5.8
33,Cop Out,2010.0,Action|Comedy|Crime,English,USA,R,107.0,2.35,37000000.0,44867349.0,...,13000.0,642.0,574.0,14483,0,2.0,75347,176.0,203.0,5.6
43,Don't Be Afraid of the Dark,2010.0,Fantasy|Horror|Thriller,English,USA,R,99.0,1.85,25000000.0,24042490.0,...,3000.0,441.0,155.0,3744,10000,1.0,40776,250.0,298.0,5.6
45,Dylan Dog: Dead of Night,2010.0,Action|Comedy|Crime|Fantasy|Horror|Mystery|Sci...,English,USA,PG-13,107.0,2.35,20000000.0,1183354.0,...,403.0,368.0,311.0,1577,0,1.0,13026,75.0,138.0,5.1
48,Eat Pray Love,2010.0,Drama|Romance,English,USA,PG-13,140.0,1.85,60000000.0,80574010.0,...,11000.0,8000.0,745.0,20440,26000,1.0,63493,302.0,213.0,5.7
60,Furry Vengeance,2010.0,Comedy|Family,English,USA,PG,92.0,1.85,35000000.0,17596256.0,...,3000.0,1000.0,734.0,6327,0,1.0,12399,84.0,101.0,3.8


In [29]:
movies_korean = movies[(movies.Language == 'Korean')]

In [30]:
movies_korean

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
657,Oldboy,2003.0,Drama|Mystery|Thriller,Korean,South Korea,R,120.0,2.35,3000000.0,2181290.0,...,717.0,78.0,38.0,852,43000,0.0,356181,809.0,305.0,8.4
890,Tae Guk Gi: The Brotherhood of War,2004.0,Action|Drama|War,Korean,South Korea,R,148.0,2.35,12800000.0,1110186.0,...,717.0,517.0,489.0,1730,0,2.0,31943,224.0,86.0,8.1
1039,Lady Vengeance,2005.0,Crime|Drama,Korean,South Korea,R,112.0,2.35,4200000000.0,211667.0,...,717.0,126.0,38.0,907,4000,0.0,53508,131.0,202.0,7.7
1367,The Host,2006.0,Comedy|Drama|Horror|Sci-Fi,Korean,South Korea,R,110.0,1.85,12215500000.0,2201412.0,...,629.0,398.0,74.0,1173,7000,0.0,68883,279.0,363.0,7.0
1781,"The Good, the Bad, the Weird",2008.0,Action|Adventure|Comedy|Western,Korean,South Korea,R,135.0,2.35,10000000.0,128486.0,...,398.0,149.0,7.0,569,0,0.0,26156,74.0,152.0,7.3


### Applying formulas on the columns
One of the much-used features of Excel is to apply formulas to create new columns from existing column values. In our Excel file, we have Gross Earnings and Budget columns. We can get Net earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel file to all the rows. We can do this in pandas also as shown below.

In [31]:
movies["Net Earnings"] = movies["Gross Earnings"] - movies["Budget"]


In [32]:
movies

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score,Net Earnings
3,Metropolis,1927.0,Drama|Sci-Fi,German,Germany,Not Rated,145.0,1.33,6000000.0,26435.0,...,23.0,18.0,203,12000,1.0,111841,413.0,260.0,8.3,-5973565.0
5,The Broadway Melody,1929.0,Musical|Romance,English,USA,Passed,100.0,1.37,379000.0,2808000.0,...,28.0,4.0,109,167,8.0,4546,71.0,36.0,6.3,2429000.0
8,42nd Street,1933.0,Comedy|Musical|Romance,English,USA,Unrated,89.0,1.37,439000.0,2300000.0,...,105.0,45.0,995,439,2.0,7921,97.0,65.0,7.7,1861000.0
11,Top Hat,1935.0,Comedy|Musical|Romance,English,USA,Approved,81.0,1.37,609000.0,3000000.0,...,172.0,23.0,824,1000,2.0,13269,98.0,66.0,7.8,2391000.0
12,Modern Times,1936.0,Comedy|Drama|Family,English,USA,G,87.0,1.37,1500000.0,163245.0,...,8.0,8.0,352,0,1.0,143086,211.0,120.0,8.6,-1336755.0
14,Snow White and the Seven Dwarfs,1937.0,Animation|Family|Fantasy|Musical,English,USA,Approved,83.0,1.37,2000000.0,184925485.0,...,47.0,31.0,229,0,1.0,133348,204.0,145.0,7.7,182925485.0
18,Gone with the Wind,1939.0,Drama|History|Romance|War,English,USA,G,226.0,1.37,3977000.0,198655278.0,...,384.0,248.0,1862,16000,1.0,215340,706.0,157.0,8.2,194678278.0
20,The Wizard of Oz,1939.0,Adventure|Family|Fantasy|Musical,English,USA,Passed,102.0,1.37,2800000.0,22202612.0,...,421.0,357.0,2509,14000,3.0,291875,533.0,213.0,8.1,19402612.0
23,Pinocchio,1940.0,Animation|Family|Fantasy|Musical,English,USA,Approved,88.0,1.37,2600000.0,84300000.0,...,48.0,40.0,1178,0,0.0,90360,147.0,105.0,7.5,81700000.0
35,Duel in the Sun,1946.0,Drama|Romance|Western,English,USA,Unrated,144.0,1.37,8000000.0,20400000.0,...,436.0,332.0,2037,403,0.0,6304,87.0,32.0,6.9,12400000.0


In [34]:
movies.shape

(5042, 26)

### Apply function on column

In [34]:
def label_movie(score):
    if score>=7:
        return "Good"
    elif score>=3:
        return "Average"
    else:
        return "Bad"

In [35]:
movies['Rating Category'] = movies['IMDB Score'].apply(label_movie)

In [36]:
movies

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score,Net Earnings,Rating Category
3,Metropolis,1927.0,Drama|Sci-Fi,German,Germany,Not Rated,145.0,1.33,6000000.0,26435.0,...,18.0,203,12000,1.0,111841,413.0,260.0,8.3,-5973565.0,Good
5,The Broadway Melody,1929.0,Musical|Romance,English,USA,Passed,100.0,1.37,379000.0,2808000.0,...,4.0,109,167,8.0,4546,71.0,36.0,6.3,2429000.0,Average
8,42nd Street,1933.0,Comedy|Musical|Romance,English,USA,Unrated,89.0,1.37,439000.0,2300000.0,...,45.0,995,439,2.0,7921,97.0,65.0,7.7,1861000.0,Good
11,Top Hat,1935.0,Comedy|Musical|Romance,English,USA,Approved,81.0,1.37,609000.0,3000000.0,...,23.0,824,1000,2.0,13269,98.0,66.0,7.8,2391000.0,Good
12,Modern Times,1936.0,Comedy|Drama|Family,English,USA,G,87.0,1.37,1500000.0,163245.0,...,8.0,352,0,1.0,143086,211.0,120.0,8.6,-1336755.0,Good
14,Snow White and the Seven Dwarfs,1937.0,Animation|Family|Fantasy|Musical,English,USA,Approved,83.0,1.37,2000000.0,184925485.0,...,31.0,229,0,1.0,133348,204.0,145.0,7.7,182925485.0,Good
18,Gone with the Wind,1939.0,Drama|History|Romance|War,English,USA,G,226.0,1.37,3977000.0,198655278.0,...,248.0,1862,16000,1.0,215340,706.0,157.0,8.2,194678278.0,Good
20,The Wizard of Oz,1939.0,Adventure|Family|Fantasy|Musical,English,USA,Passed,102.0,1.37,2800000.0,22202612.0,...,357.0,2509,14000,3.0,291875,533.0,213.0,8.1,19402612.0,Good
23,Pinocchio,1940.0,Animation|Family|Fantasy|Musical,English,USA,Approved,88.0,1.37,2600000.0,84300000.0,...,40.0,1178,0,0.0,90360,147.0,105.0,7.5,81700000.0,Good
35,Duel in the Sun,1946.0,Drama|Romance|Western,English,USA,Unrated,144.0,1.37,8000000.0,20400000.0,...,332.0,2037,403,0.0,6304,87.0,32.0,6.9,12400000.0,Average


In [39]:
movies

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score,Net Earnings,Rating
0,Intolerance: Love's Struggle Throughout the Ages,1916.0,Drama|History|War,,USA,Not Rated,123.0,1.33,385907.0,,...,9.0,481,691,1.0,10718,88.0,69.0,8.0,,Good
1,Over the Hill to the Poorhouse,1920.0,Crime|Drama,,USA,,110.0,1.33,100000.0,3000000.0,...,0.0,4,0,1.0,5,1.0,1.0,4.8,2900000.0,Average
2,The Big Parade,1925.0,Drama|Romance|War,,USA,Not Rated,151.0,1.33,245000.0,,...,6.0,108,226,0.0,4849,45.0,48.0,8.3,,Good
3,Metropolis,1927.0,Drama|Sci-Fi,German,Germany,Not Rated,145.0,1.33,6000000.0,26435.0,...,18.0,203,12000,1.0,111841,413.0,260.0,8.3,-5973565.0,Good
4,Pandora's Box,1929.0,Crime|Drama|Romance,German,Germany,Not Rated,110.0,1.33,,9950.0,...,3.0,455,926,1.0,7431,84.0,71.0,8.0,,Good
5,The Broadway Melody,1929.0,Musical|Romance,English,USA,Passed,100.0,1.37,379000.0,2808000.0,...,4.0,109,167,8.0,4546,71.0,36.0,6.3,2429000.0,Average
6,Hell's Angels,1930.0,Drama|War,English,USA,Passed,96.0,1.20,3950000.0,,...,4.0,457,279,1.0,3753,53.0,35.0,7.8,,Good
7,A Farewell to Arms,1932.0,Drama|Romance|War,English,USA,Unrated,79.0,1.37,800000.0,,...,99.0,1284,213,1.0,3519,46.0,42.0,6.6,,Average
8,42nd Street,1933.0,Comedy|Musical|Romance,English,USA,Unrated,89.0,1.37,439000.0,2300000.0,...,45.0,995,439,2.0,7921,97.0,65.0,7.7,1861000.0,Good
9,She Done Him Wrong,1933.0,Comedy|Drama|History|Musical|Romance,English,USA,Approved,66.0,1.37,200000.0,,...,28.0,583,328,1.0,4152,59.0,35.0,6.5,,Average


### Pivot Table in pandas
Advanced Excel users also often use pivot tables. A pivot table summarizes the data of another table by grouping the data on an index and applying operations such as sorting, summing, or averaging. You can use this feature in pandas too.

We need to first identify the column or columns that will serve as the index, and the column(s) on which the summarizing formula will be applied. Let’s start small, by choosing Year as the index column and Net Earnings as the summarization column and creating a separate DataFrame from this data.

In [57]:
movies_subset = movies[['Year','Net Earnings']]

In [58]:
movies_subset

Unnamed: 0,Year,Net Earnings
3,1927.0,-5973565.0
5,1929.0,2429000.0
8,1933.0,1861000.0
11,1935.0,2391000.0
12,1936.0,-1336755.0
14,1937.0,182925485.0
18,1939.0,194678278.0
20,1939.0,19402612.0
23,1940.0,81700000.0
35,1946.0,12400000.0


In [68]:
earnings_by_year = movies_subset.pivot_table(index=['Year'])


The values shown in the table are the result of the summarization that ``aggfunc`` applies to the feature data. aggfunc is an aggregate function that pivot_table applies to your grouped data.

By default, it is ``np.mean()``, but you can use different aggregate functions for different features too! Just provide a dictionary as an input to the aggfunc parameter with the feature name as the key and the corresponding aggregate function as the value.

In [61]:
earnings_by_year.tail()

Unnamed: 0_level_0,Net Earnings
Year,Unnamed: 1_level_1
2012.0,19745470.0
2013.0,13148260.0
2014.0,18531240.0
2015.0,15064900.0
2016.0,14384950.0


#### Use sum as aggregate function

In [70]:
import numpy as np
earnings_by_year_sum = movies_subset.pivot_table(index=['Year'],aggfunc=np.sum)


In [71]:
earnings_by_year_sum

Unnamed: 0_level_0,Net Earnings
Year,Unnamed: 1_level_1
1927.0,-5.973565e+06
1929.0,2.429000e+06
1933.0,1.861000e+06
1935.0,2.391000e+06
1936.0,-1.336755e+06
1937.0,1.829255e+08
1939.0,2.140809e+08
1940.0,8.170000e+07
1946.0,3.395000e+07
1947.0,-2.292073e+06


### Exporting the results to Excel
You can export or write a pandas DataFrame to an Excel file using pandas ```to_excel``` method. 

In [104]:
movies.to_excel('movies_dataframe.xlsx')

### Exporting the results to CSV
You can export or write a pandas DataFrame to an CSV file using pandas ```to_csv``` method. 

In [107]:
movies.to_csv('dataset/movies_dataframe.csv',sep='\t')