# HW2 - Develop a Categorization System#

This problem set is meant to help you familiarize yourself with Python and Pandas. 

### Before You Start
For this problem set, you should download INF0202-HW2.ipynb from bCourses. Create a local copy of the notebook and rename it LASTNAME_FIRSTNAME-HW2.ipynb. Then edit your renamed file directly in your browser by typing:
```
jupyter notebook <name_of_downloaded_file>
```

Make sure the following libraries load correctly (hit Shift-Enter).

In [2]:
#IPython is what you are using now to run the notebook
import IPython
print("IPython version:      %6.6s (need at least 1.0)" % IPython.__version__)

# Pandas makes working with data tables easier
import pandas as pd
print("Pandas version:       %6.6s (need at least 0.11.0)" % pd.__version__)

# Module for plotting
import matplotlib as plt
%matplotlib inline
print("Maplotlib version:    %6.6s (need at least 1.2.1)" % plt.__version__)

# A tool we'll use to aid our data exploration
import itertools

IPython version:       4.2.0 (need at least 1.0)
Pandas version:       0.18.1 (need at least 0.11.0)
Maplotlib version:     1.5.1 (need at least 1.2.1)


### Working in a group?
List the names of other students with whom you worked on this problem set:

### Introduction to the assignment

For this assignment and upcoming assignments, you will be using an IMDB Movie Dataset (download from bCourses). 

Use the following commands to load the dataset:

In [3]:
#load dataset
imdb = pd.read_csv("IMDB_movies.csv", low_memory=False, encoding = "ISO-8859-1")

#subset to only first 100 movies
imdb = imdb[:100]

### Understanding the data

Let's take a look at the dataset with some lightweight exploratory data analysis.

In [4]:
imdb.head()

Unnamed: 0,Rank,Title,Genre,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


Are there any nulls we need to watch out for?

In [5]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
Rank                  100 non-null int64
Title                 100 non-null object
Genre                 100 non-null object
Director              100 non-null object
Actors                100 non-null object
Year                  100 non-null int64
Runtime (Minutes)     100 non-null int64
Rating                100 non-null float64
Votes                 100 non-null int64
Revenue (Millions)    91 non-null float64
Metascore             94 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 8.7+ KB


Since most of these films in the data set are part of multiple genres, let's get a list of entirely unique genres, without repeats, to see how many genres we are working with.

In [6]:
unique_genres = imdb['Genre'].unique()
individual_genres = []

#iterate through, get each combination
for genre in unique_genres:
    individual_genres.append(genre.split(','))

#get individual genres per row
individual_genres = list(itertools.chain.from_iterable(individual_genres))

#remove duplicates
individual_genres = set(individual_genres)

individual_genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western'}

### Brainstorm a Categorization System

Categories provide the framework for organizing resources. Classification assigns individual resources to categories. When humans classify, we have rationales for how we assign resources to categories. These criteria are in part how we carve up the categories themsevles. The "principles" for defining categories (enumeration, properties, similarity, cultural vs individual vs institutional) are embodied in the classifications that use these principles. 

#### Your task is to create 3 new categories (columns) for this dataset. Before beginning, please outline responses to the questions below. 
1. What is the purpose of these categories? How might each of these categories be used in an information retrieval task?
2. What "principles" will you be using to define categories? Briefly explain why you've chosen these principles to define your categories. 
3. Are the categories at a consistent level of abstraction and granularity? Briefly explain your choice of abstraction and granularity for each category.
4. What are the data types of your categories? Ordinal? Categorical? Continuous? Other? Briefly explain your choice of data type for each category. 

In [7]:
# question 1
# record your response here

In [8]:
# question 2
# record your response here

In [9]:
# question 3
# record your response here

In [10]:
# question 4
# record your response here

### Develop a Categorization System

Using the data contained in the dataframe `imdb` created above, create three new categories and append them as new columns to the dataframe `imdb`. Becuase there are only 100 rows, you can either assign categories by hand or use a function to do so.
_Hint: if using a function, it may be useful to use the function pandas.DataFrame.apply._

In [11]:
# your code here - category 1

In [12]:
# your code here - category 2

In [13]:
# your code here - category 3

### Display Final Categorization System

Calling "imdb.head" should result in the full dataset, with three additional categories created. 

In [14]:
imdb.head()

Unnamed: 0,Rank,Title,Genre,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
