# Using Pandas to Transform Data

## Objectives: 

-  

## Agenda

- 

Some great resources:

- https://www.dataschool.io/
- https://realpython.com/fast-flexible-pandas/
- https://chrisalbon.com/#python
- https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html
- https://medium.com/dunder-data/pandas-tutorials/home

## Pandas 

In [None]:
import pandas as pd
import numpy as np

### Create a dataframe from a csv file

In [None]:
movies = pd.read_csv('IMDB-Movie-Data.csv', index_col=0)
movies

In [None]:
movies.shape

In [None]:
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 100)

In [None]:
movies

## What is subset selection?
Before we start doing subset selection, it might be good to define what it is. Subset selection is simply selecting particular rows and columns of data from a DataFrame (or Series). This could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. 

**Indexers**

There are many ways to select subsets of data, but in this article we will only cover the usage of the square brackets ([]), .loc and .iloc. Collectively, they are called the indexers. These are by far the most common ways to select data.

**What’s the difference between indexing and selecting subsets of data?**

The documentation uses the term indexing frequently. This term is essentially just a one-word phrase to say ‘subset selection’. I prefer the term subset selection as, again, it is more descriptive of what is actually happening. Indexing is also the term used in the official Python documentation.


**A term for just those square brackets**
The term **indexing operator** is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. I will use the term just the indexing operator to refer to df[]. This will distinguish it from df.loc[] and df.iloc[].

### Indexing Operator

selecting one column as a series

In [None]:
movies['Title']

selecting one column as a dataframe

In [None]:
movies[['Title']]

selecting multiple columns

In [None]:
movies[['Title', 'Year', 'Rating']]

In [None]:
#using the indexing operator to select rows
movies[3:6]

### Summary: Indexing Operator
- Its primary purpose is to select columns by the column names
- Select a single column as a Series by passing the column name directly to it: df['col_name']
- Select multiple columns as a DataFrame by passing a list to it: df[['col_name1', 'col_name2']]
- You actually can select rows with it, but this is confusing and not used often.

### .loc
The .loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns. Most importantly, it only selects data by the LABEL of the rows and columns.

In [None]:
movies_title = movies.set_index('Title')

In [None]:
movies_title.head()

In [None]:
movies_title.loc['Sing']



In [None]:
movies_title.loc[['Prometheus','Sing']]

In [None]:
movies_title.loc['Prometheus':'Sing']


**Selecting rows and columns simultaneously with .loc**

Unlike just the indexing operator, it is possible to select rows and columns simultaneously with .loc. You do it by separating your row and column selections by a comma. It will look something like this:

`df.loc[row_selection, column_selection]`

In [None]:
movies.loc[1:3, ['Title','Genre', 'Year']]

In [None]:
movies_title.loc['Prometheus':'Sing', ['Genre', 'Year']]

In [None]:
movies_title.loc[['Prometheus','Sing'], ['Genre', 'Year']]

### Summary of .loc
- Only uses labels
- Can select rows and columns simultaneously
- Selection can be a single label, a list of labels or a slice of labels
- Put a comma between row and column selections

### Getting started with .iloc
The .iloc indexer is very similar to .loc but only uses integer locations to make its selections. The word .iloc itself stands for integer location so that should help with remember what it does.

In [None]:
movies.iloc[3]

In [None]:
movies.iloc[3:6]

In [None]:
movies_title.iloc[:5]

In [None]:
movies_title.iloc[3:4, :3]

## Conditional selections


For example, what if we want to filter our movies DataFrame to show only films directed by Ridley Scott or films with a rating greater than or equal to 8.0?


In [None]:
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})
data

In [None]:
data[[True, False, True, False,True, False,True,]]

In [None]:
data.duplicated()

In [None]:
# method that returns a boolean Series indicating whether each row 
# is a duplicate or not
data[data.duplicated()]


In [None]:
condition = (movies['Director'] == "Ridley Scott")

condition

We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False films. To return the rows where that condition is True we have to pass this operation into the DataFrame:

In [None]:
movies[condition]

In [None]:
movies[movies['Director'] == "Ridley Scott"]


**Find how many movies were directed by Christopher Nolan.**


In [None]:
#put your code here
____[____ == _____]

Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:



In [None]:
movies[movies['Rating'] >= 8.8]


We can make some richer conditionals by using logical operators: 
- `|`    for "or"  
- `&`    for "and"


Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:

In [None]:
movies[(movies['Director'] == 'Christopher Nolan') | (movies['Director'] == 'Ridley Scott')]


** What is the average revenue of all of the movies by Christopher Nolan that have a score of 8.7 or better**

In [None]:
# your code here

____ = movies[(____) & (_____)]

____[_____]._____()

Using the `isin()` method we could make this more concise though:

In [None]:
movies[movies['Director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()


Using `~` flips your booleans and allows you to find the inverse of your query.  

In [None]:
movies[~movies['Director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()


## Applied Question: 

Which group of movies has the higher average revenue those with a rating of above 8, or those with at least 300,000 reviews?

In [None]:
# Subset the dataframe to find movies with a rating above 8.


In [None]:
# Find the average revenue of that group.


In [None]:
# Subset the dataframe to find movies with more than 300,000 votes.


In [None]:
# find the average of that group


## Data transformation

### Removing duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:


In [None]:

data.drop_duplicates()

If I call `data` again why are there still duplicates?

In [None]:
data

In [None]:
data.drop_duplicates(inplace=True)
data

In [None]:
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})

#we are adding another column to help us understand which row was dropped
data['v1'] = range(7)

data

In [None]:

data.drop_duplicates(['k1'])

In [None]:

data.drop_duplicates(['k1'], keep='last')

In [None]:
myData = data.drop_duplicates(['k1'], keep='last')

## Create a new column

In [None]:
movies['new_col'] = 0

In [None]:
movies.head()

In [None]:
movies['power'] = movies['Rating']* np.log(movies['Votes'])

In [None]:
movies.head()

## Conditional Transformations

You want to create a new column that shows the revenue for a movie if the movie is longer than 90 minutes or shorter than 190 minutes. 

Given what you knwo about python, how would you go about doing this? 

Write pseudo-code with a partner to come up with the workflow. 

#### Looping with .itertuples() and .iterrows()
What other approaches can you take? Well, Pandas has actually made the for i in range(len(df)) syntax redundant by introducing the DataFrame.itertuples() and DataFrame.iterrows() methods. These are both generator methods that yield one row at a time.

`.itertuples()` yields a namedtuple for each row, with the row’s index value as the first element of the tuple. A nametuple is a data structure from Python’s collections module that behaves like a Python tuple but has fields accessible by attribute lookup.

`.iterrows()` yields pairs (tuples) of (index, Series) for each row in the DataFrame.

While .itertuples() tends to be a bit faster, let’s stay in Pandas and use .iterrows() in this example, because some readers might not have run across nametuple. Let’s see what this achieves:

In [None]:
@timeit(repeat=3, number=100)
def apply_transormation_iterrows(df):
    revenue_list = []
    for index, row in df.iterrows():
        time = row['Runtime (Minutes)']
        
        #use function
        # Append cost list
        revenue = apply_time(time)
        revenue_list.append(revenue)
    df['rev_features'] = revenue_list

apply_transformation_iterrows(movies)

### Pandas’ .apply()
You can further improve this operation using the .apply() method instead of .iterrows(). Pandas’ .apply() method takes functions (callables) and applies them along an axis of a DataFrame (all rows, or all columns). In this example, a lambda function will help you pass the column of data into apply_time():

In [None]:
@timeit(repeat=3, number=100)
def apply_transformation_withapply(df):
    df['rev_features_a'] = df.apply(
        lambda row: apply_time(
            row['Runtime (Minutes)']),
        axis=1)

apply_transformation_withapply(df)