# 主要内容——pandas分析电影数据
1. 导入数据
2. 合并数据框——merge
3. 透视表——pivot_table
4. 分组统计——groupby

# 1.导入MovieLens 1M数据集

In [3]:
import pandas as pd

In [4]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

In [5]:
users = pd.read_table('pydata-book-master/ch02/movielens/users.dat', sep = '::', header = None, names = unames)

  if __name__ == '__main__':


In [6]:
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

In [7]:
ratings = pd.read_table('pydata-book-master/ch02/movielens/ratings.dat', sep = '::', header = None, names = rnames)

  if __name__ == '__main__':


In [8]:
mnames = ['movie_id', 'title', 'genres']

In [9]:
movies = pd.read_table('pydata-book-master/ch02/movielens/movies.dat', sep = '::', header = None, names = mnames)

  if __name__ == '__main__':


查看数据集

In [10]:
movies[:5]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [11]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [12]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


# 2.合并数据框
利用merge合并数据集，pandas会根据列名的重叠情况判断出哪些列是链接键。首先，users和ratings有相同列名:user_id;movies和ratings有相同的列名:moive_id,以此来合并数据集

In [13]:
data = pd.merge(pd.merge(ratings, users), movies)

In [14]:
data.head()
#查看数据集

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


In [15]:
data.ix[0]
#查看列名

user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

# 3.数据透视表
为了按性别计算每部电影的平均得分，我们使用类似于excel数据透视表功能的pivot_table函数进行操作。

其中，values表示要进行操作的变量，index表示索引，即行名，columns表示列名，即按照gender进行分类，aggfunc表示显示的内容，默认为mean，显示均值。

In [16]:
mean_ratings = data.pivot_table(values = 'rating', index = 'title',
                                columns = 'gender', aggfunc = 'mean')

这个操作产生了一个新的数据框。

# 4.分组统计
现在过滤掉评分数据不足250条的电影，即统计title出现的频数。利用groupby函数进行分组统计。

In [19]:
mean_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024


In [21]:
# 按title进行分组，然后利用size()函数得到一个含有电影分组大小的series
ratings_by_title = data.groupby('title').size()

In [22]:
ratings_by_title.head()

title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64

In [26]:
#选择评分数据不足250条的记录的索引（index）
active_titles = ratings_by_title.index[ratings_by_title >= 250]

In [27]:
active_titles

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

In [28]:
#根据索引从mean_ratings中筛选出所需的行
mean_ratings = mean_ratings.ix[active_titles]

In [30]:
mean_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.5
101 Dalmatians (1996),3.24,2.911215
12 Angry Men (1957),4.184397,4.328421


为了了解女性观众最喜爱的电影，F列进行降序排列

In [31]:
top_female_ratings = mean_ratings.sort_index(by = 'F', ascending = False)

  if __name__ == '__main__':


In [32]:
top_female_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Close Shave, A (1995)",4.644444,4.473795
"Wrong Trousers, The (1993)",4.588235,4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.57265,4.464589
Wallace & Gromit: The Best of Aardman Animation (1996),4.563107,4.385075
Schindler's List (1993),4.562602,4.491415


计算评分分歧

In [33]:
mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']

In [34]:
sorted_by_diff = mean_ratings.sort_index(by = 'diff')

  if __name__ == '__main__':


In [35]:
sorted_by_diff.head()

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dirty Dancing (1987),3.790378,2.959596,-0.830782
Jumpin' Jack Flash (1986),3.254717,2.578358,-0.676359
Grease (1978),3.975265,3.367041,-0.608224
Little Women (1994),3.870588,3.321739,-0.548849
Steel Magnolias (1989),3.901734,3.365957,-0.535777
