# Data Mining
## Practical Assignment 3

### Important Notes:
1. Submit through **Blackboard** in electronic form before 11:59pm on Wednesday, May 31, 2017
2. No late homework will be accepted.
3. This is a group-of-three assignment; hence choose two partner to work with.
4. The submitted file should be in ipynb format
5. The file should be submitted by all students in the group.
6. The assignment is worth it 10 points
7. For questions, please use [Piazza](http://piazza.com/university_of_amsterdam/spring2017/5072dami6y) (English only!)
8. The indication **optional** means that the question is optional; you won't lose any points if you do not do that part of the assignment, nor will you gain if you do it.

### Software:
We will be using Python programming language throughout this course. Further we will be using:
+ IPython Notebooks (as an environment)
+ Numpy
+ Pandas
+ Scikit-learn


### Background:

This practical assignment will be covering clustering and working with text. 

For the assignment, please download the dataset on [Movies](https://drive.google.com/drive/folders/0B-zklbckv9CHMmhzSXRPMk9tSWs?usp=sharing).

The folder contains two data files and a README file. Both data files, plot_summaries.txt and movie.metadata.tsv are tab separated files. The former, i.e. plot_summaries.txt, contains the plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary. The latter file, i.e. movie.metadata.tsv contains metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase. Freebase is a knowledge base (similar to a database) that contains information about different Entities (including movies). The file is tab-separated with the following columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie name
4. Movie release date
5. Movie box office revenue
6. Movie runtime
7. Movie languages (Freebase ID:name tuples)
8. Movie countries (Freebase ID:name tuples)
9. Movie genres (Freebase ID:name tuples)

The goal of this assignment will be to cluster movies.

**Important Note**: This third assignment is not as instructive as the first assignment. The first assignment guided you step-by-step through all the preprocessing, training-validation-testing setup, etc. This assignment does not do so, but it leaves it up to you to decide how to use the data and design your experiments.

### Part 1: Import the data

We import both files and performing a join (merging the two files) using the Wikipedia ID (WID) to match the movies that appear in summaries to those that appear in the metadata. If a movie does not appear in either file, it is not included in the final list.

In [1]:
import pandas as pd
metadata = pd.read_csv('MovieSummaries/movie.metadata.tsv',sep="\t", header = None,
                        names=['WID', 'FID', 'Name', 'Release', 'Revenue', 
                               'Runtime', 'Languages', 'Countries', 'Genres'])
summaries = pd.read_csv('MovieSummaries/plot_summaries.txt',sep="\t", header = None,
                         names=['WID', 'Text'])
films = pd.merge(metadata, summaries, on='WID')

A movie can be have more than one genre. We extract the first genre that characterizes the movie. Some movies may not have any corresponing genres.

In [2]:
import ast

genres = []
for film in films.values:
    exist = False
    g = ast.literal_eval(film[8])
    # Get the first genre for this movie
    for key in g:
        exist = True
        genres.append(g[key])
        break
    # If there is no genre for this movie
    if exist is False:
        genres.append('')

Consider only movies in four genres: 'Drama', 'Comedy', 'Science Fiction', 'Action'. Then sort them by Revenue they had in the cinemas, and get the top 100 most popular ones.

In [3]:
# Merge the films with the genre into a single Dataframe
genres = pd.Series(genres, name='Genre')
films_genre = pd.concat([films, genres], axis=1)

# Get only movies about the four following genres
films_genre_ind = films_genre.set_index('Genre')
movie_genres = ['Drama', 'Comedy', 'Science Fiction', 'Action']
genre100 = pd.DataFrame()
for mg in movie_genres:
    genre100 = genre100.append(films_genre_ind.ix[mg])

# Get the top-100 of those
top100 = (genre100.sort_values(by='Revenue',ascending=False)[0:100]).reset_index()[['Name','Text','Genre']]

In [4]:
# Look at the distribution of your movies in the dataset
print(top100['Genre'].value_counts())

Drama              45
Comedy             32
Science Fiction    12
Action             11
Name: Genre, dtype: int64


### Part 2: Turn movies into BoW and Topics representation (Lecture 7) (5pts)

Turn each movie plot summary (i.e. the 'Text' column in the top100 dataframe) into:

In [79]:
documents = top100['Text'].asobject

* **Bag-of-Words**

In [80]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [144]:
def BagOfWords(documents):
    vect = CountVectorizer(analyzer="word",
                          stop_words=None)
    return vect.fit_transform(documents)

bow = BagOfWords(documents)

* **Bag-of-bigrams**

In [145]:
def BagOfBigrams(documents):
    vect = CountVectorizer(ngram_range=(2,2),
                           analyzer="word",
                          stop_words=None)
    return vect.fit_transform(documents)

bo2g = BagOfBigrams(documents)

* **Bag-of-ngrams (for n = 1 and 2)**

In [147]:
def BagOfNgrams(documents, n=1):
    vect = CountVectorizer(ngram_range=(n,n),
                           analyzer="word",
                          stop_words=None)
    return vect.fit_transform(documents)

bo12g = BagOfNgrams(documents, n=2)

* **TF-IDF values**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [156]:
def tfidf(documents):    
    tfidf = TfidfVectorizer(norm=None).fit(documents)
    return tfidf.fit_transform(documents)

tfidf = tfidf(documents)
# print(tfidf.toarray())

* **Topics**
    * Experiment with different number of topics and choose the one that satisfies you by inspecting the top-10 most important words in each topic.

In [None]:
# your code goes here
topics = # your code goes here
# your code goes here

### Part 3: Clustering (Lecture 6) (5pts)

Cluster the movies using the k-means algorithm.

**Important Note**: In order to allow you to work on Part 3, before Part 2, the [Movies](https://drive.google.com/open?id=0B-zklbckv9CHcUtwcWxTTjlvcHc) folder also contains a comma-separated file that includes a representation I built for you for the movies plot summary.

##### k-means

+ Choose the number of ~~topics~~ clusters you wish to find in the data (n_clusters)
+ Choose a text representation from Part 2
+ Run a k-means algorithm
+ Evaluate the quality of the algorithm using inertia_, silhouette_score, adjusted_mutual_info_score, and adjusted_rand_score
    + silhouette_score, adjusted_mutual_info_score require the ground truth
    + use as ground truth the genre of each movie, i.e. the perfect clustering would be the one that clusters movies based on their genre

In [None]:
# your code goes here

** number of cluster **

* Change the value of n\_clusters and plot inertia_, silhouette_score, adjusted_mutual_info_score, and adjusted_rand_score as a function of n_clusters
* Explain what you observe in the plots.
* Do the same for each text representation from Part 2.
* Explain the differences across different representations if there are any

In [None]:
# your code goes here

*your explanations go here*

** demonstrate clusters **

* for each representation choose the optimal number of clusters and repeat the k-means algorithm for that number of clusters
* print the top-10 most important words within each cluster
* print the titles of the movies for each cluster
* explain what you observe and whether results make sense

In [None]:
# your code goes here

* your explanations go here