<a href="https://colab.research.google.com/github/wesleybeckner/technology_fundamentals/blob/main/C1%20Fundamentals/SOLUTIONS/SOLUTION_Tech_Fun_C1_S3_Functions_and_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Technology Fundamentals Course 1, Session 3: Functions and Pandas Introduction

**Instructor**: Wesley Beckner<br>

**Contact**: wesleybeckner@gmail.com


**Teaching Assitants** Varsha Bang, Harsha Vardhan

**Contact** vbang@uw.edu, harshav@uw.edu


<br>

---

<br>

Today, we will discuss **_functions_** in more depth.  We've seen them previously and used them, for example the `.append()` **_function_** for lists, or the even more general `print()` function.  Here, we'll dig into how you can make your own functions to encapsulate code that you will reuse over and over.  

Then we'll jump into the **Pandas** package.  Packages are collections of related functions.  These are the things we `import`. Pandas is a two dimensional data structure like a spreadsheet in Excel. In fact, we will be importing our first dataset and viewing it, with Pandas!

<br>

---

## 3.0 Review from Session on Data Structures and Flow Control

In our last session, we discussed **_lists_**, **_dictionaries_**, and **_flow control_**.

**_Lists_** are **_ordered collections_** of data that can be used to hold multiple pieces of information while preserving their order.  We use `[` and `]` to access elements by their indices which start with `0`.  All things that operate on **_lists_** like slices use the concept of an inclusive lower bound and an exclusive upper bound.  So, the following gets elements from the **_list_** `my_list` with index values of `0`, `1`, and `2`, but **not** `3`!

```
my_list[0:3]
```

> What other way is there of writing the same statement using **_slicing_**?  Hint, think about leaving out one of the numbers in the slice!

**_Dictionaries_** are **_named_** **_collections_** of data that can be used to hold multiple pieces of information as **_values_** that are addressed by **_keys_** resulting in a **_key_** to **_value_** data structure.  They are accessed with `[` and `]` but intialized with `{` and `}`.  E.g.

```
my_dict = { 'cake' : 'Tasty!', 'toenails' : 'Gross!' }
my_dict['cake']
```

Finally, we talked about **_flow control_** and using the concept of **_conditional execution_** to decide which code statements were executed.  Remember this figure?

<img src="https://docs.oracle.com/cd/B19306_01/appdev.102/b14261/lnpls008.gif">Flow control figure</img>

> What are the **_if_** statments? <br> 
Where do **_for_** loops fit in? <br>

## 3.1 Functions

For loops let you repeat some code for every item in a list.  Functions are similar in that they run the same lines of code and, frequently, for new values of some variable (we call these **_parameters_**).  They are different in that functions are not limited to looping over items.

Functions are a critical part of writing easy to read, reusable code.

Create a function like:
```
def function_name (parameters):
    """
    optional docstring
    """
    function expressions
    return [variable]
```

Here is a simple example.  It prints a string that was passed in and returns nothing.

```
def print_string(string):
    """This prints out a string passed as the parameter."""
    print(string)
    return
```

In [None]:
def print_string(string):
  """This prints out a string passed as the parameter"""
  print(string)
  return

To call the function, use:
```
print_string("GIX is awesome!")
```

_Note:_ The function has to be defined before you can call it!

In [None]:
print_string("GIX is awesome!")

GIX is awesome!


### 3.1.1 Reserved words: def, return, and yield

Notice the highlighted words in our function definition: `def` and `return` these are *reserved words* in python used to define functions. Every function definition requires these reserved words. `yield` is another reserved word that is similar to `return` but operates slightly differently. It is beyond the scope of what we are covering in this session. This tutorial from [realpython](https://realpython.com/introduction-to-python-generators/) has good information on the topic.

In [None]:
# what is return doing in this function?
def my_square(a):
  return a ** 2

`return` is going to output whatever value(s) follow after the keyword `return` when we call upon our function 

In [None]:
a = 2
my_square(a)

4

I'm going to return two values...

In [None]:
def my_square(a):
  return a ** 2, a

and we see how the output updates accordingly

In [None]:
my_square(a)

(4, 2)

We can capture these values on the output with...

In [None]:
a = 3
square, new_a = my_square(a)

In [None]:
print(square, new_a)

9 3


### 3.1.2 Global vs local variables and function parameters

In a function, new variables that you create are not saved when the function returns - these are **_local_** variables.  Variables defined outside of the function can be accessed but not changed - these are **_global_** variables.

let's define the following function

In [None]:
def my_little_func(a):
  b = 10
  return a * b

In [None]:
my_little_func(2)

20

if I run the following...

In [None]:
# b

Let's play with this a little further...

...now let's define b outside the function and call our function with `a=5`

In [None]:
# what happens here?
b = 100
my_little_func(2)

20

In [None]:
b

100

we see that b is still 100, instead of 10 as its defined within the function. This is because b inside of `my_little_func` is a *local* variable. 

it doesn't matter how I define b outside the function because within the function it is set locally.

... Let's do this A LITTLE MORE

In [None]:
def my_new_func(a):
  print(b)
  return a*b

now if I call on my new function, because `b` is not defined locally within the function, it takes on the global value. 

This is typically not happy happy fun fun behavior for us, we want to be explicit about how we define and use our variables (but there are some times when this is appropriate to do)

In [None]:
b = 100
a = 2 # side note, what did I do here????
my_new_func(a)

100


200

#### 3.1.1.1 Function Parameters

Parameters (or arguments) in Python are all passed by reference.  This means that if you modify the parameters in the function, they are modified outside of the function. (Enrichment: Exceptions, see below)

See the following example:

```
def change_list(my_list):
   """This changes a passed list into this function"""
   my_list.append('four');
   print('list inside the function: ', my_list)
   return

my_list = [1, 2, 3];
print('list before the function: ', my_list)
change_list(my_list);
print('list after the function: ', my_list)
```

In [None]:
def change_list(my_list):
   """This changes a passed list into this function"""
   my_list.append('four');
   print('list inside the function: ', my_list)
   return

my_list = [1, 2, 3]
print('list before the function: ', my_list)
change_list(my_list)
print('list after the function: ', my_list)

list before the function:  [1, 2, 3]
list inside the function:  [1, 2, 3, 'four']
list after the function:  [1, 2, 3, 'four']


#### 3.1.1.2 Enrichment: Global, local, and immutables

Let's go back to our former example...

immutables:

* integers, float, str, tuples

In [None]:
# b = "a string"
# b = 10
# b = 10.2
# b = (10, 2)
b = [10, 2]

def my_little_func(b):
  if type(b) == str:
    b += "20"
  elif (type(b) == int) or (type(b) == float):
    b += 10
  elif (type(b) == tuple):
    print("AYYY no tuple changes, Dude")
    pass
  elif (type(b) == list):
    b.append('whoaaaa')
  print(b)
  return

print(b)
my_little_func(b)
print(b)

[10, 2]
[10, 2, 'whoaaaa']
[10, 2, 'whoaaaa']


There is a way to change a global variable within a function with the **_global_** keyword.  Generally, the use of **_global_** variables is not encouraged, instead use parameters. We won't cover the global keyword here but you can [explore further](https://www.programiz.com/python-programming/global-keyword) on your own if you are interested. 

In [None]:
b = 10
a = 2

def my_little_func(a):
  global b
  b += 20
  print(b)
  return 

print(b)
my_little_func(a)
print(b)

10
30
30


#### Exercise 1: My first function

Write a function that takes one parameter and returns any data structure

> If you are going to return multiple objects, what data structure that we talked about can be used?  Give and example below.

In [None]:
# Cell for excerise 1
def test(a):
  return a * 10

In [None]:
def my_first_function(a):
    return type(a)

m='haha'
my_first_function(m)

str

### 3.1.3 Parameter types

**Function calling:**

* positional 
    * `func(10, 20)`
* keyword
    * `func(a=10, b=20)` or `func(b=20, a=10)`

**Function writing:**
* no default
    * `def func(a, b)`
* default
    * `def func(a=10, b=20)`



```
def print_name(first, last='Beckner'):
    print(f'Your name is {first} {last}')
    return
```

In [None]:
def print_name(first, last='Beckner'):
    print("Your name is {} {}".format(first, last))
    return

In [None]:
print_name('Wesley')

Your name is Wesley Beckner


Play around with the above function.

In [None]:
print_name('Wesley', last='Solo')

Your name is Wesley Solo


Functions can contain any code that you put anywhere else including:
* `if`...`elif`...`else`
* `for`...`while`
* other function calls

```
def print_name_age(first, last, age):
    print_name(first, last)
    print('Your age is %d' % (age))
    if age > 25 and age < 40:
        print('You are a millenial!')
    return
```


In [None]:
def print_name_age(first, last, age):
    print_name(first, last)
    print('Your age is %d' % (age))
    if age > 25 and age < 40:
        print('You are a millenial!')
    return

```
print_name_age(age=29, last='Beckner', first='Wesley')
```

In [None]:
print_name_age(age=29, last='Beckner', first='Wesley')

Your name is Wesley Beckner
Your age is 29
You are a millenial!


### 3.1.4 Docstrings

Quick example on docstrings from [sklearn knn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

## 3.2 The scientific python stack

In addition to Python's built-in modules like the ``math`` module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python.
Some of the most important ones are:

#### [``numpy``](http://numpy.org/): Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data.
If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

#### [``scipy``](http://scipy.org/): Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more.
We will not look closely at Scipy today, but we will use its functionality later in the course.

#### [``pandas``](http://pandas.pydata.org/): Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a *Data Frame*.
If you've used the [R](http://rstats.org) statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

#### [``matplotlib``](http://matplotlib.org): Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

#### [``scikit-learn``](https://scikit-learn.org/stable/): Machine Learning in Python

Scikit-learn is a machine learning library.

It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and DBSCAN.

The library is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

## 3.3 Pandas

### 3.3.1 Pandas and Scikit-Learn `load_datasets`

We begin by loading the Panda's package.  Packages are collections of functions that share a common utility.  We've seen `import` before.  Let's use it to import Pandas and all the richness that pandas has.

We'll also use a very useful feature of the scikit-learn toolkit, the `load_datasets` module. We will do some very rudimentary tasks with this dataset, just to demonstrate the utility of `load_datasets`, then we will switch over to a more relevant dataset for our purposes.

```
import pandas
from sklearn.datasets import load_wine
```

In [None]:
import pandas
from sklearn.datasets import load_wine

We import a function `load_wine` that loads a simple data set we can play with called the Wine recognition dataset from the 1980s.

You can read more about that dataset [here](https://archive.ics.uci.edu/ml/datasets/Wine)

```
dataset = load_wine()
print(dataset.DESCR)
```

In [None]:
dataset = load_wine()
print(dataset.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

```
df = pandas.DataFrame()
```

In [None]:
df = pandas.DataFrame()

#### 3.3.1.1 import ... as ... pattern

Because we'll use it so much, we often import under a shortened name using the ``import ... as ...`` pattern:

```
import pandas as pd
```

In [None]:
import pandas as pd

### 3.3.2 Creating pandas dataframes

Let's create an empty _data frame_ and put the result into a variable called `df`.  This is a popular choice for a _data frame_ variable name.

```
df = pd.DataFrame()
```

In [None]:
df = pd.DataFrame()

Let's open the Wine dataset as a pandas data frame.  Notice we change the value of the `df` variable to point to a new data frame.

```
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
```

In [None]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)

#### 3.3.2.1 From excel and csv

Please follow this [link](https://raw.githubusercontent.com/wesleybeckner/ds_for_engineers/main/data/truffle_margin/margin_data.csv)

This is what we call a csv or comma separated value file. We have a method reading these directly into pandas:

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/wesleybeckner/technology_explorers/main/assets/imdb_movies.csv')

  interactivity=interactivity, compiler=compiler, result=result)


We can do this in a similar way with excel files. 

In [None]:
pd.read_excel('https://raw.githubusercontent.com/wesleybeckner/ds_for_engineers/main/data/truffle_margin/margin_data.xlsx')

Unnamed: 0,Base Cake,Truffle Type,Primary Flavor,Secondary Flavor,Color Group,Width,Height,Net Sales Quantity in KG,EBITDA,Product
0,Tiramisu,Chocolate Outer,Doughnut,Egg Nog,Amethyst,340,50,8244.500,21833.99,Tiramisu-Chocolate Outer-Doughnut-Egg Nog-Amet...
1,Tiramisu,Chocolate Outer,Doughnut,Egg Nog,Amethyst,1340,25,1857.000,21589.48,Tiramisu-Chocolate Outer-Doughnut-Egg Nog-Amet...
2,Tiramisu,Chocolate Outer,Chocolate,Pear,Amethyst,310,140,17365.000,19050.69,Tiramisu-Chocolate Outer-Chocolate-Pear-Amethy...
3,Tiramisu,Chocolate Outer,Doughnut,Egg Nog,Amethyst,449,50,14309.000,18573.01,Tiramisu-Chocolate Outer-Doughnut-Egg Nog-Amet...
4,Tiramisu,Chocolate Outer,Doughnut,Rock and Rye,Amethyst,640,80,25584.500,14790.90,Tiramisu-Chocolate Outer-Doughnut-Rock and Rye...
...,...,...,...,...,...,...,...,...,...,...
2501,Butter,Chocolate Outer,Lemon Bar,Wild Cherry Cream,Amethyst,930,50,150352.000,-97839.16,Butter-Chocolate Outer-Lemon Bar-Wild Cherry C...
2502,Butter,Chocolate Outer,Cream Soda,Peppermint,Amethyst,900,50,120451.400,-98661.97,Butter-Chocolate Outer-Cream Soda-Peppermint-A...
2503,Butter,Jelly Filled,Orange,Cucumber,Burgundy,905,50,143428.580,-122236.96,Butter-Jelly Filled-Orange-Cucumber-Burgundy-9...
2504,Butter,Chocolate Outer,Horchata,Dill Pickle,Amethyst,597,45,271495.572,-128504.49,Butter-Chocolate Outer-Horchata-Dill Pickle-Am...


#### 3.3.2.2 from lists

In [None]:
my_list = [[1, 2, 3], [3, 4, 5], [5, 6, 7], [7, 8, 9]]
pd.DataFrame(my_list)

Unnamed: 0,0,1,2
0,1,2,3
1,3,4,5
2,5,6,7
3,7,8,9


In [None]:
df_nums = pd.DataFrame(data=[[1, 2, 3], [3, 4, 5], [5, 6, 7], [7, 8, 9]], 
             columns=['x', 'y', 'z'])
df_nums.index = ['a', 'b', 'c', 'd']
df_nums

Unnamed: 0,x,y,z
a,1,2,3
b,3,4,5
c,5,6,7
d,7,8,9


#### 3.3.2.3 from dictionaries

In [None]:
my_dictionary = {'A': ['apple', 'airplane'], 'B': ['bannana', 'bubbles']}
from_dict = pd.DataFrame(my_dictionary)
from_dict

Unnamed: 0,A,B
0,apple,bannana
1,airplane,bubbles


In [None]:
from_dict.to_dict()

{'A': {0: 'apple', 1: 'airplane'}, 'B': {0: 'bannana', 1: 'bubbles'}}

#### Exercise 2: Create a DataFrame

Create a dictionary with the following keys: `movies, songs, books`. In each key list your top 5 favorites in the cooresponding category. Then use `pd.DataFrame` to turn this into a dictionary.

In [None]:
# Cell for Ex 2
faves = {'movies': ['The Matrix', 'The Return of the King', 'Howls Moving Castle',
                    'Harold and Maude', 'The Truman Show'],
         'songs': ['Clair de Lune', 'Father Ocean', 'Sun & Moon', 
                   'Infinite Resource', 'Father and Son'],
         'books': ['Buddhas Brain', 'The Tombs of Antuan', 'The Gunslinger',
                   'Wizard and Glass', 'Dune']}
faves = pd.DataFrame(faves)
faves

Unnamed: 0,movies,songs,books
0,The Matrix,Clair de Lune,Buddhas Brain
1,The Return of the King,Father Ocean,The Tombs of Antuan
2,Howls Moving Castle,Sun & Moon,The Gunslinger
3,Harold and Maude,Infinite Resource,Wizard and Glass
4,The Truman Show,Father and Son,Dune


#### 3.3.2.4 on `pandas.Series`

pandas `Series` objects will percolate in our experience here and there, however they are not so important as for us to wish to spend dedicated time on them. For now, know that they are a lower-level data collection in the pandas framework. You can think of them as an individual column or row in the pandas dataframe. For more practice with these you can refer to [this documentation]()

In [None]:
df['movies']

0                The Matrix
1    The Return of the King
2       Howls Moving Castle
3          Harold and Maude
4           The Truman Show
Name: movies, dtype: object

### 3.3.3 Viewing pandas dataframes

The ``head()`` and ``tail()`` methods show us the first and last rows of the data.

```
df.head()
df.tail()
```

In [None]:
df.head(5)

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,"Urban Gad, Gebhard Schätzler-Perasini",Fotorama,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


In [None]:
df.tail(7)

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
85848,tt9905462,Pengalila,Pengalila,2019,2019-03-08,Drama,111,India,Malayalam,T.V. Chandran,T.V. Chandran,Benzy Productions,"Lal, Akshara Kishor, Iniya, Narain, Renji Pani...",An unusual bond between a sixty year old Dalit...,8.8,553,INR 10000000,,,,,
85849,tt9906644,Manoharam,Manoharam,2019,2019-09-27,"Comedy, Drama",122,India,Malayalam,Anvar Sadik,,chakkalakal Films,"Vineeth Sreenivasan, Aparna Das, Basil Joseph,...",Manoharan is a poster artist struggling to fin...,6.8,491,,,,,9.0,1.0
85850,tt9908390,Le lion,Le lion,2020,2020-01-29,Comedy,95,"France, Belgium",French,Ludovic Colbeau-Justin,"Alexandre Coquelle, Matthieu Le Naour",Monkey Pack Films,"Dany Boon, Philippe Katerine, Anne Serra, Samu...",A psychiatric hospital patient pretends to be ...,5.3,398,,,$ 3507171,,,4.0
85851,tt9911196,De Beentjes van Sint-Hildegard,De Beentjes van Sint-Hildegard,2020,2020-02-13,"Comedy, Drama",103,Netherlands,"German, Dutch",Johan Nijenhuis,"Radek Bajgar, Herman Finkers",Johan Nijenhuis & Co,"Herman Finkers, Johanna ter Steege, Leonie ter...",A middle-aged veterinary surgeon believes his ...,7.7,724,,,$ 7299062,,6.0,4.0
85852,tt9911774,Padmavyuhathile Abhimanyu,Padmavyuhathile Abhimanyu,2019,2019-03-08,Drama,130,India,Malayalam,Vineesh Aaradya,"Vineesh Aaradya, Vineesh Aaradya",RMCC Productions,"Anoop Chandran, Indrans, Sona Nair, Simon Brit...",,7.9,265,,,,,,
85853,tt9914286,Sokagin Çocuklari,Sokagin Çocuklari,2019,2019-03-15,"Drama, Family",98,Turkey,Turkish,Ahmet Faik Akinci,"Ahmet Faik Akinci, Kasim Uçkan",Gizem Ajans,"Ahmet Faik Akinci, Belma Mamati, Metin Keçeci,...",,6.4,194,,,$ 2833,,,
85854,tt9914942,La vida sense la Sara Amat,La vida sense la Sara Amat,2019,2020-02-05,Drama,74,Spain,Catalan,Laura Jou,"Coral Cruz, Pep Puig",La Xarxa de Comunicació Local,"Maria Morera Colomer, Biel Rossell Pelfort, Is...","Pep, a 13-year-old boy, is in love with a girl...",6.7,102,,,$ 59794,,,2.0


The ``shape`` attribute shows us the number of elements:

```
df.shape
```

Note it doesn't have the `()` because it isn't a **_function_** - it is an **_attribute_** or variable attached to the `df` object.

In [None]:
df.shape

(85855, 22)

The ``columns`` attribute gives us the column names

```
df.columns
```


In [None]:
df.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

The ``index`` attribute gives us the index names

```
df.index
```

In [None]:
df.index

RangeIndex(start=0, stop=85855, step=1)

The ``dtypes`` attribute gives the data types of each column, remember the data type *_floating point_**?:

```
df.dtypes
```

In [None]:
df.dtypes

imdb_title_id             object
title                     object
original_title            object
year                      object
date_published            object
genre                     object
duration                   int64
country                   object
language                  object
director                  object
writer                    object
production_company        object
actors                    object
description               object
avg_vote                 float64
votes                      int64
budget                    object
usa_gross_income          object
worlwide_gross_income     object
metascore                float64
reviews_from_users       float64
reviews_from_critics     float64
dtype: object

In [None]:
df.describe()

Unnamed: 0,duration,avg_vote,votes,metascore,reviews_from_users,reviews_from_critics
count,85855.0,85855.0,85855.0,13305.0,78258.0,74058.0
mean,100.351418,5.898656,9493.49,55.896881,46.040826,27.479989
std,22.553848,1.234987,53574.36,17.784874,178.511411,58.339158
min,41.0,1.0,99.0,1.0,1.0,1.0
25%,88.0,5.2,205.0,43.0,4.0,3.0
50%,96.0,6.1,484.0,57.0,9.0,8.0
75%,108.0,6.8,1766.5,69.0,27.0,23.0
max,808.0,9.9,2278845.0,100.0,10472.0,999.0


#### Exercise 3: Viewing DataFrames

Using the dataframe you made in exercise 2, return the following attributes: the datatype stored in each column, the column names, the indices, and the shape.

In [None]:
# Cell for Ex 3
print(faves.dtypes)
print(faves.columns)
print(faves.index)
print(faves.shape)

movies    object
songs     object
books     object
dtype: object
Index(['movies', 'songs', 'books'], dtype='object')
RangeIndex(start=0, stop=5, step=1)
(5, 3)


In [None]:
my_dictionary = {'A': ['apple', 'airplane'], 'B': ['bananana', 'bubbles']}
my_dictionary

{'A': ['apple', 'airplane'], 'B': ['bananana', 'bubbles']}

In [None]:
my_dictionary.pop('A')

['apple', 'airplane']

In [None]:
my_dictionary

{'B': ['bananana', 'bubbles']}

### 3.3.4 Manipulating data with ``pandas``

Here we'll cover some key features of manipulating data with pandas

#### 3.3.4.1 Selection

Access columns by name using square-bracket indexing:

```
df['duration']
```

In [None]:
df['duration']

0         45
1         70
2         53
3        100
4         68
        ... 
85850     95
85851    103
85852    130
85853     98
85854     74
Name: duration, Length: 85855, dtype: int64

Mathematical operations on columns happen *element-wise*:

```
df['duration'] / 60
```

In [None]:
df['duration'] / 60

0        0.750000
1        1.166667
2        0.883333
3        1.666667
4        1.133333
           ...   
85850    1.583333
85851    1.716667
85852    2.166667
85853    1.633333
85854    1.233333
Name: duration, Length: 85855, dtype: float64


Columns can be created (or overwritten) with the assignment operator.
Let's create a column with duration in hours.

```
df['duration (hours)'] = df['duration'] / 60
```

In [None]:
df['duration (hours)'] = df['duration'] / 60

Let's use the `.head()` **_function_** to see our new data!

```
df.head()
```

In [None]:
df.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,duration (hours)
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0,0.75
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0,1.166667
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,"Urban Gad, Gebhard Schätzler-Perasini",Fotorama,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0,0.883333
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0,1.666667
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0,1.133333


##### 3.3.4.1.1 `loc` and `iloc`

Pandas provides a powerful way to work with both rows and columns together, optionally using their label indices or numeric indices.

- **`.loc :`**<br/>
Purely label-location based indexer for selection by label (but may also be used with a boolean array).<br/>
  **Important: If you use slicing in loc, it will return the end index as well**
  

- **`.iloc:`**<br/>
Purely integer-location based indexing for selection by position (but may also be used with a boolean array).

In [None]:
df.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics', 'duration (hours)'],
      dtype='object')

In [None]:
df.loc[::2, ::2]

Unnamed: 0,imdb_title_id,original_title,date_published,duration,language,writer,actors,avg_vote,budget,worlwide_gross_income,reviews_from_users,duration (hours)
0,tt0000009,Miss Jerry,1894-10-09,45,,Alexander Black,"Blanche Bayliss, William Courtenay, Chauncey D...",5.9,,,1.0,0.750000
2,tt0001892,Den sorte drøm,1911-08-19,53,,"Urban Gad, Gebhard Schätzler-Perasini","Asta Nielsen, Valdemar Psilander, Gunnar Helse...",5.8,,,5.0,0.883333
4,tt0002130,L'Inferno,1911-03-06,68,Italian,Dante Alighieri,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",7.0,,,31.0,1.133333
6,tt0002423,Madame DuBarry,1919-11-26,85,German,"Norbert Falk, Hanns Kräly","Pola Negri, Emil Jannings, Harry Liedtke, Edua...",6.8,,,12.0,1.416667
8,tt0002452,Independenta Romaniei,1912-09-01,120,,"Aristide Demetriade, Petre Liciu","Aristide Demetriade, Constanta Demetriade, Con...",6.7,ROL 400000,,4.0,2.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
85846,tt9904802,Enemy Lines,2020-05-04,92,"English, Polish, Russian, German","Michael Wright, Tom George","Ed Westwick, John Hannah, Tom Wisdom, Corey Jo...",5.0,,,29.0,1.533333
85848,tt9905462,Pengalila,2019-03-08,111,Malayalam,T.V. Chandran,"Lal, Akshara Kishor, Iniya, Narain, Renji Pani...",8.8,INR 10000000,,,1.850000
85850,tt9908390,Le lion,2020-01-29,95,French,"Alexandre Coquelle, Matthieu Le Naour","Dany Boon, Philippe Katerine, Anne Serra, Samu...",5.3,,$ 3507171,,1.583333
85852,tt9911774,Padmavyuhathile Abhimanyu,2019-03-08,130,Malayalam,"Vineesh Aaradya, Vineesh Aaradya","Anoop Chandran, Indrans, Sona Nair, Simon Brit...",7.9,,,,2.166667


In [None]:
df.iloc[-5:, -5:]

Unnamed: 0,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,duration (hours)
85850,$ 3507171,,,4.0,1.583333
85851,$ 7299062,,6.0,4.0,1.716667
85852,,,,,2.166667
85853,$ 2833,,,,1.633333
85854,$ 59794,,,2.0,1.233333


##### 3.3.4.1.2 column vs index access

In [None]:
df['title'][:10]

0                                           Miss Jerry
1                          The Story of the Kelly Gang
2                                       Den sorte drøm
3                                            Cleopatra
4                                            L'Inferno
5    From the Manger to the Cross; or, Jesus of Naz...
6                                       Madame DuBarry
7                                           Quo Vadis?
8                                Independenta Romaniei
9                                          Richard III
Name: title, dtype: object

In [None]:
df[:10]['title']

0                                           Miss Jerry
1                          The Story of the Kelly Gang
2                                       Den sorte drøm
3                                            Cleopatra
4                                            L'Inferno
5    From the Manger to the Cross; or, Jesus of Naz...
6                                       Madame DuBarry
7                                           Quo Vadis?
8                                Independenta Romaniei
9                                          Richard III
Name: title, dtype: object

In [None]:
# df[0] # will return an error

In [None]:
my_list = [[10, 20, 30]]*4
my_list

[[10, 20, 30], [10, 20, 30], [10, 20, 30], [10, 20, 30]]

In [None]:
my_list = [[10, 20, 30]]*4
mydf = pd.DataFrame(my_list, 
                    index=['a','b','c','d'], 
                    columns=['alpha', 'beta', 'gamma'])
mydf

Unnamed: 0,alpha,beta,gamma
a,10,20,30
b,10,20,30
c,10,20,30
d,10,20,30


In [None]:
mydf.loc['a', 'alpha'] = 'mychange'
mydf.loc[['a', 'b'],['alpha']]

Unnamed: 0,alpha
a,mychange
b,10


In [None]:
# using this you will get a setting
# with copy warning (depending on your pandas warning settings)
mydf['alpha']['a'] = 'newchange'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


You want to use `loc` or `iloc` when setting new values to pandas dataframes.

##### Exercise 4: Selecting

select the first 10 rows of the country, genre, and year columns using `loc`. Repeat the same exercise using `iloc`

In [None]:
# Cell for Ex 4
df.loc[:10, ['country', 'genre', 'year']]

Unnamed: 0,country,genre,year
0,USA,Romance,1894
1,Australia,"Biography, Crime, Drama",1906
2,"Germany, Denmark",Drama,1911
3,USA,"Drama, History",1912
4,Italy,"Adventure, Drama, Fantasy",1911
5,USA,"Biography, Drama",1912
6,Germany,"Biography, Drama, Romance",1919
7,Italy,"Drama, History",1913
8,Romania,"History, War",1912
9,"France, USA",Drama,1912


In [None]:
df.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics', 'duration (hours)'],
      dtype='object')

In [None]:
df.iloc[:10, [3,7,5]]

Unnamed: 0,year,country,genre
0,1894,USA,Romance
1,1906,Australia,"Biography, Crime, Drama"
2,1911,"Germany, Denmark",Drama
3,1912,USA,"Drama, History"
4,1911,Italy,"Adventure, Drama, Fantasy"
5,1912,USA,"Biography, Drama"
6,1919,Germany,"Biography, Drama, Romance"
7,1913,Italy,"Drama, History"
8,1912,Romania,"History, War"
9,1912,"France, USA",Drama


#### 3.3.4.2 Filtering

filtering down your selection will be BIGLY useful in your data quests

##### 3.3.4.2.1 By String

one of the first tools we'll use to filter our dataset is the `.str.contains` method. Let's take an example.

In [None]:
# remember, if we don't remember our column mames we can quickly pull them up 
# with:
df.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics', 'duration (hours)'],
      dtype='object')

In [None]:
df['description']

0        The adventures of a female reporter in the 1890s.
1        True story of notorious Australian outlaw Ned ...
2        Two men of high rank are both wooing the beaut...
3        The fabled queen of Egypt's affair with Roman ...
4        Loosely adapted from Dante's Divine Comedy and...
                               ...                        
85850    A psychiatric hospital patient pretends to be ...
85851    A middle-aged veterinary surgeon believes his ...
85852                                                  NaN
85853                                                  NaN
85854    Pep, a 13-year-old boy, is in love with a girl...
Name: description, Length: 85855, dtype: object

In [None]:
df.loc[df['description'].str.contains('artificial intelligence', na=False)].shape

(12, 23)

or if you know the exact string you are looking for

In [None]:
df.loc[df['title'] == "Fight Club"]

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,duration (hours)
32487,tt0137523,Fight Club,Fight Club,1999,1999-10-29,Drama,139,"USA, Germany",English,David Fincher,"Chuck Palahniuk, Jim Uhls",Fox 2000 Pictures,"Edward Norton, Brad Pitt, Meat Loaf, Zach Gren...",An insomniac office worker and a devil-may-car...,8.8,1807440,$ 63000000,$ 37030102,$ 101218804,66.0,3758.0,370.0,2.316667


##### 3.3.4.2.2 By numerical value

In [None]:
df[df['votes'] > 1000]

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,duration (hours)
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0,1.133333
11,tt0002844,Fantômas - À l'ombre de la guillotine,Fantômas - À l'ombre de la guillotine,1913,1913-05-12,"Crime, Drama",54,France,French,Louis Feuillade,"Marcel Allain, Louis Feuillade",Société des Etablissements L. Gaumont,"René Navarre, Edmund Breon, Georges Melchior, ...",Inspector Juve is tasked to investigate and ca...,7.0,1944,,,,,9.0,28.0,0.900000
13,tt0003037,Juve contre Fantômas,Juve contre Fantômas,1913,1913-09-08,"Crime, Drama",61,France,French,Louis Feuillade,"Marcel Allain, Louis Feuillade",Société des Etablissements L. Gaumont,"René Navarre, Edmund Breon, Georges Melchior, ...",In Part Two of Louis Feuillade's 5 1/2-hour ep...,7.0,1349,,,,,8.0,23.0,1.016667
16,tt0003165,Le mort qui tue,Le mort qui tue,1913,1913-11-06,"Crime, Drama, Mystery",90,France,French,Louis Feuillade,"Marcel Allain, Louis Feuillade",Société des Etablissements L. Gaumont,"René Navarre, Edmund Breon, Georges Melchior, ...",After a body disappears from inside the prison...,7.0,1050,,,,,6.0,18.0,1.500000
18,tt0003419,Lo studente di Praga,Der Student von Prag,1913,1913-08-22,"Drama, Fantasy, Horror",85,Germany,"German, English","Paul Wegener, Stellan Rye","Hanns Heinz Ewers, Hanns Heinz Ewers",Deutsche Bioscop GmbH,"Paul Wegener, Grete Berger, Lyda Salmonova, Jo...","Balduin, a student of Prague, leaves his royst...",6.5,1768,,,,,20.0,26.0,1.416667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85811,tt9860728,Falling Inn Love - Ristrutturazione con amore,Falling Inn Love,2019,2019-08-29,"Comedy, Romance",98,USA,English,Roger Kumble,"Elizabeth Hackett, Hilary Galanoy",,"Christina Milian, Adam Demos, Jeffrey Bowyer-C...",When city girl Gabriela spontaneously enters a...,5.6,14108,,,,,265.0,32.0,1.633333
85817,tt9866700,Paranormal Investigation,Paranormal Investigation,2018,2018-12-01,"Horror, Thriller",92,France,French,Franck Phelizon,,Baril Pictures,"Jose Atuncar, Claudine Bertin, Cedric Henquez,...",When a young man becomes possessed after playi...,3.7,1299,,,,,334.0,11.0,1.533333
85837,tt9894470,VFW,VFW,2019,2020-02-14,"Action, Crime, Horror",92,USA,English,Joe Begos,"Max Brallier, Matthew McArdle",Fangoria,"Stephen Lang, William Sadler, Fred Williamson,...",A group of old war veterans put their lives on...,6.1,4178,,,$ 23101,72.0,83.0,94.0,1.533333
85839,tt9898858,Coffee & Kareem,Coffee & Kareem,2020,2020-04-03,"Action, Comedy",88,USA,English,Michael Dowse,Shane Mack,Pacific Electric Picture Company,"Ed Helms, Taraji P. Henson, Terrence Little Ga...",Twelve-year-old Kareem Manning hires a crimina...,5.1,10627,,,,35.0,388.0,64.0,1.466667


##### Exercise 5: Filtering

* Filter `df` for all the movies that are longer than 2 hours
* Filter `df` for all movies where 'day' is in the title

In [None]:
# Cell for Ex 5
df.loc[df['duration'] > 120].shape

(11166, 23)

In [None]:
df.loc[df['title'].str.contains('day')].shape

(202, 23)

#### 3.3.4.3 Select, filter, operation

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at *value counts* and the basics of *group-by* operations.

In [None]:
# a basic select, filter, operate procedure would look like:
df[df['country'] == 'USA']['duration'].describe()

count    28511.000000
mean        93.050437
std         18.576873
min         42.000000
25%         84.000000
50%         91.000000
75%        100.000000
max        398.000000
Name: duration, dtype: float64

we can invert the selection with `~`

In [None]:
df[~(df['country'] == 'USA')]['duration'].describe()

count    57344.000000
mean       103.981410
std         23.459158
min         41.000000
25%         90.000000
50%         99.000000
75%        112.000000
max        808.000000
Name: duration, dtype: float64

In preparation for grouping the data, let's bin the instances by their duration (we could have chosen any numerical column). For that, we'll use ``pd.cut``.  The documentation for ``pd.cut`` can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html). It is used to bin values into discrete intervals.  This is like a histogram where for each *bin* along the range of data values, you count the number of occurrences of that bin.  in our example, we'll use 10 bins and let Pandas decide how to evenly divide the range into the bins.  Let's see it in action.

```
df['duration_group'] = pd.cut(df['duration'], 10)
df.head()
df.dtypes
```

In [None]:
df['duration'].describe()

count    85855.000000
mean       100.351418
std         22.553848
min         41.000000
25%         88.000000
50%         96.000000
75%        108.000000
max        808.000000
Name: duration, dtype: float64

In [None]:
pd.cut(df['duration'], 10)

0        (40.233, 117.7]
1        (40.233, 117.7]
2        (40.233, 117.7]
3        (40.233, 117.7]
4        (40.233, 117.7]
              ...       
85850    (40.233, 117.7]
85851    (40.233, 117.7]
85852     (117.7, 194.4]
85853    (40.233, 117.7]
85854    (40.233, 117.7]
Name: duration, Length: 85855, dtype: category
Categories (10, interval[float64]): [(40.233, 117.7] < (117.7, 194.4] < (194.4, 271.1] <
                                     (271.1, 347.8] ... (501.2, 577.9] < (577.9, 654.6] <
                                     (654.6, 731.3] < (731.3, 808.0]]

In [None]:
df['duration_group'] = pd.cut(df['duration'], 10)

In [None]:
df.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,duration (hours),duration_group
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0,0.75,"(40.233, 117.7]"
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0,1.166667,"(40.233, 117.7]"
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,"Urban Gad, Gebhard Schätzler-Perasini",Fotorama,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0,0.883333,"(40.233, 117.7]"
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0,1.666667,"(40.233, 117.7]"
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0,1.133333,"(40.233, 117.7]"


In [None]:
df.dtypes

imdb_title_id              object
title                      object
original_title             object
year                       object
date_published             object
genre                      object
duration                    int64
country                    object
language                   object
director                   object
writer                     object
production_company         object
actors                     object
description                object
avg_vote                  float64
votes                       int64
budget                     object
usa_gross_income           object
worlwide_gross_income      object
metascore                 float64
reviews_from_users        float64
reviews_from_critics      float64
duration (hours)          float64
duration_group           category
dtype: object

Pandas includes an array of useful functionality for manipulating and analyzing tabular data.
We'll take a look at two of these here.

The ``pandas.value_counts`` returns statistics on the unique values within each column.

We can use it, for example, to break down the movies by their duration group that we just created:

```
pd.value_counts(df['duration_group'], sort=False)
```

In [None]:
pd.value_counts(df['duration_group'], sort=False)

(40.233, 117.7]    72368
(117.7, 194.4]     13197
(194.4, 271.1]       228
(271.1, 347.8]        40
(347.8, 424.5]        11
(424.5, 501.2]         4
(501.2, 577.9]         4
(577.9, 654.6]         1
(654.6, 731.3]         1
(731.3, 808.0]         1
Name: duration_group, dtype: int64

What happens if we try this on a continuous valued variable?

```
pd.value_counts(df['duration'])
```

In [None]:
pd.value_counts(df['duration'])

90     5162
95     3194
100    3106
92     2418
93     2414
       ... 
279       1
301       1
345       1
729       1
319       1
Name: duration, Length: 266, dtype: int64

##### Exercise 6: `value_counts, unique, nunique`

We can do a little data exploration with this by seeing how common different values are. Play around with these pandas methods:

* `value_counts()`
* `unique()`
* `nunique()`

Also be sure to use:

* selection
* filteration
* (and you are already using operation with the above mentioned pandas methods, value_counts, unique, nunique (: )

Do so with 3 different columns in the dataframe

In [None]:
# Cell for Exercise 6
df[(df['country'] == 'USA') & (df['votes'] > 1000)]['director'].value_counts()

Michael Curtiz                   40
John Ford                        37
Woody Allen                      36
Clint Eastwood                   34
Raoul Walsh                      32
                                 ..
Richard Howard                    1
Curt A. Sindelar                  1
Don Chaffey                       1
Jerry Zucker                      1
Jennifer Flackett, Mark Levin     1
Name: director, Length: 5140, dtype: int64

In [None]:
df[(df['country'] == 'USA') & (df['votes'] > 1000)]['director'].nunique()

5140

#### 3.3.4.4 Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations.
You can visualize the group-by like this (image borrowed from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do))

![image](https://swcarpentry.github.io/r-novice-gapminder/fig/12-plyr-fig1.png)

#### 3.3.4.5 Summary statistics with groupby: `value_counts`,  `count`, `describe`

Let's break take this in smaller steps.
Recall our ``duration_group`` column.

```
pd.value_counts(df['duration_group'])
```

In [None]:
pd.value_counts(df['duration_group'])

(40.233, 117.7]    72368
(117.7, 194.4]     13197
(194.4, 271.1]       228
(271.1, 347.8]        40
(347.8, 424.5]        11
(501.2, 577.9]         4
(424.5, 501.2]         4
(731.3, 808.0]         1
(654.6, 731.3]         1
(577.9, 654.6]         1
Name: duration_group, dtype: int64

`groupby` allows us to look at the number of values for each column and each value.  The group by documentation is [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).  Basically, `groupby` allows us to create *groups* of records based on their values. Let's count how many records, or rows, in our data set fall into each bin of our duration data. 

```
df.groupby(['duration_group']).count()
```

In [None]:
df.groupby(['duration_group']).count()

Unnamed: 0_level_0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,duration (hours)
duration_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
"(40.233, 117.7]",72368,72368,72368,72368,72368,72368,72368,72315,71618,72309,71474,68922,72304,70464,72368,72368,19493,12123,24645,10759,65972,63321,72368
"(117.7, 194.4]",13197,13197,13197,13197,13197,13197,13197,13186,13122,13171,12528,12211,13192,12990,13197,13197,4133,3153,6298,2506,12017,10493,13197
"(194.4, 271.1]",228,228,228,228,228,228,228,228,221,226,221,207,228,226,228,228,70,40,59,28,215,188,228
"(271.1, 347.8]",40,40,40,40,40,40,40,40,40,40,39,39,40,38,40,40,10,5,9,8,37,37,40
"(347.8, 424.5]",11,11,11,11,11,11,11,11,10,11,11,10,11,11,11,11,1,3,3,1,9,10,11
"(424.5, 501.2]",4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,1,0,0,1,4,4,4
"(501.2, 577.9]",4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,1,0,0,0,2,3,4
"(577.9, 654.6]",1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,1
"(654.6, 731.3]",1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1
"(731.3, 808.0]",1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1


Now, let's find the mean of each of the columns for each ``duration_group``.  *Notice* what happens to the non-numeric columns.

```
df.groupby(['duration_group']).mean()
```

In [None]:
df.groupby(['duration_group']).mean()

Unnamed: 0_level_0,duration,avg_vote,votes,metascore,reviews_from_users,reviews_from_critics,duration (hours)
duration_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"(40.233, 117.7]",93.147026,5.786671,6604.507932,54.153267,35.944037,24.517601,1.55245
"(117.7, 194.4]",136.55255,6.488513,25010.089414,63.042298,100.725639,45.365958,2.275876
"(194.4, 271.1]",220.394737,6.997368,30003.302632,76.5,94.711628,29.015957,3.673246
"(271.1, 347.8]",302.975,6.81,4201.275,79.375,19.378378,21.513514,5.049583
"(347.8, 424.5]",384.636364,7.181818,2602.545455,89.0,18.333333,18.1,6.410606
"(424.5, 501.2]",454.0,7.7,2589.0,59.0,19.25,24.5,7.566667
"(501.2, 577.9]",547.5,7.875,206.5,,1.5,8.666667,9.125
"(577.9, 654.6]",580.0,5.8,157.0,,,,9.666667
"(654.6, 731.3]",729.0,7.8,1126.0,87.0,13.0,30.0,12.15
"(731.3, 808.0]",808.0,7.7,473.0,77.0,5.0,23.0,13.466667


You can specify a groupby using the names of table columns and compute other functions, such as the ``sum``, ``count``, ``std``, and ``describe``.

```
df.groupby(['duration_group'])['metascore'].describe()
```

In [None]:
df.groupby(['duration_group'])['metascore'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
duration_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"(40.233, 117.7]",10759.0,54.153267,17.655622,1.0,41.0,55.0,67.0,100.0
"(117.7, 194.4]",2506.0,63.042298,16.265137,9.0,52.0,64.0,75.0,100.0
"(194.4, 271.1]",28.0,76.5,20.53272,10.0,69.75,82.0,90.0,100.0
"(271.1, 347.8]",8.0,79.375,12.070478,56.0,72.25,84.5,88.25,90.0
"(347.8, 424.5]",1.0,89.0,,89.0,89.0,89.0,89.0,89.0
"(424.5, 501.2]",1.0,59.0,,59.0,59.0,59.0,59.0,59.0
"(501.2, 577.9]",0.0,,,,,,,
"(577.9, 654.6]",0.0,,,,,,,
"(654.6, 731.3]",1.0,87.0,,87.0,87.0,87.0,87.0,87.0
"(731.3, 808.0]",1.0,77.0,,77.0,77.0,77.0,77.0,77.0


The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

```
<data object>.groupby(<grouping values>).<aggregate>()
```

You can even group by multiple values: for example we can look at the metascore grouped by the ``duration_group`` and ``country``.

In [None]:
df.groupby(['duration_group', 'country'])['metascore'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
duration_group,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"(40.233, 117.7]","Afghanistan, France",0.0,,,,,,,
"(40.233, 117.7]","Afghanistan, France, Germany, UK",1.0,64.0,,64.0,64.0,64.0,64.0,64.0
"(40.233, 117.7]","Afghanistan, Iran",0.0,,,,,,,
"(40.233, 117.7]","Afghanistan, Ireland, Japan, Iran, Netherlands",1.0,83.0,,83.0,83.0,83.0,83.0,83.0
"(40.233, 117.7]",Albania,0.0,,,,,,,
...,...,...,...,...,...,...,...,...,...
"(501.2, 577.9]","Philippines, Netherlands, Sweden",0.0,,,,,,,
"(501.2, 577.9]",Russia,0.0,,,,,,,
"(577.9, 654.6]",Soviet Union,0.0,,,,,,,
"(654.6, 731.3]",France,1.0,87.0,,87.0,87.0,87.0,87.0,87.0


##### Exercise 7: Group-by

<ol>
<li>use <code>pd.cut</code> to perform a grouping of one or more of the dataframe columns
<li>use <code>groupby</code> to group by that (those) columns and then perform
<li>three different statistical summaries in three separate instances



In [None]:
# Cell for excercise 7