# Analyzing our netflix activity

If you're a Netflix customer, you can see/download/delete all the data about your usage that they use to recommend new shows to you, based on your history.

Here we have just a small sample of such history.

The purpose here is to measure our addictiveness to *The Office* show.

## Data prep

### Load and have a look at the data

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
data = pd.read_csv('ViewingActivity-sample.csv')

In [None]:
data.head()

In [None]:
data.shape

### Issues

The `Start Time` column is in UTC, while the user is in New York (Eastern US), so we'll have to convert it to his timezone.

The `Title` column holds both the name of the show and the watched episode, so we'll have to figure out a way to take care of this to filter only the rows of *The Office* show. 

### Dropping unnecessary columns (optional)

For the purpose of this exercise we'll only be needing the `Start Time`, `Duration` and `Title` columns, so we can get rid of the rest to save memory.

In [None]:
data = data.drop(['Profile Name', 'Attributes', 'Supplemental Video Type', 'Device Type', 'Bookmark', 'Latest Bookmark', 'Country'], axis=1)
data.head()

### Renaming columns for easier access (optional)

Having to constantly type capital letters and spaces is boring. Let's quickly rename our column names?

In [None]:
data.columns = [col.replace(' ', '_').lower() for col in data]
data.head()

### Issue 1 - converting strings to datetime and timedelta

So we have to work with dates to solve this one. Are your `start_time` and `duration` columns already of the datetime datatype?

In [None]:
data.dtypes

They are not. So here's what we'll have to do:

1. Convert `start_time` to datetime (a data and time format pandas can understand and perform calculations with)
2. Convert `start_time` from UTC to the eastern US timezone
3. Convert `duration` to timedelta (a time duration format pandas can understand and perform calculations with)

#### #1

In [None]:
data['start_time'] = pd.to_datetime(data['start_time'], utc=True)
data.dtypes

#### #2

We'll be using the `tz_convert()` method here.

The tricky part is that we can only use `tz_convert()` on a DatetimeIndex, so we need to set our `start_time` column as the index using `set_index()` before we perform the conversion (we'll have to find out how to select an index in a pandas dataframe). We'll then use `reset_index()` to turn it back into a regular column afterwards.

In [None]:
data = data.set_index('start_time')

# convert from UTC timezone to eastern time
data.index = data.index.tz_convert('US/Eastern')

# reset the index so that start_time becomes a column again
data = data.reset_index()

data.head()

#### #3

In [None]:
data['duration'] = pd.to_timedelta(data['duration'])
data.dtypes

### Issue 2 - filtering strings by substring

There are many ways we could approach filtering *The Office* views. For our purposes here, though, we're going to create a new dataframe called `office` and populate it only with rows where the `title` column contains *The Office (U.S.)*.

Hint: we can do this using `str.contains()`

In [None]:
office = data[data['title'].str.contains('The Office (U.S.)', regex=False)]

office.head()

### Filtering out short durations using timedelta

As you might have noticed, we have very small durations in the dataset. That's mainly because watching a preview also counts as a view.

So let's filter out all the views that have a duration of less than a minute.

In [None]:
office = office[(office['duration'] > '0 days 00:01:00')]

office.head()

## Analyzing the Data

### How much time have I spent watching *The Office*?

In [None]:
print(office['duration'].sum())

### When do I watch The Office?

Let's answer this question in two different ways:

1. On which days of the week have I watched the most *Office* episodes?
2. During which hours of the day do I most often start *Office* episodes?

We'll start with a little prep work that'll make these tasks a little more straightforward: creating new columns for `weekday` and `hour`.

In [None]:
weekdays = office['start_time'].dt.day_name()
office['weekday'] = weekdays

hours = office['start_time'].dt.hour
office['hour'] = hours

office.head()

#### #1

In [None]:
# set our categorical and define the order so the days are plotted Monday-Sunday
office['weekday'] = pd.Categorical(office['weekday'], 
    categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
    ordered=True)

office.dtypes

In [None]:
office_by_day = office['weekday'].value_counts()

office_by_day

In [None]:
office_by_day = office_by_day.sort_index()
office_by_day.head()

In [None]:
office_by_day.plot(kind='bar', figsize=(10,6), title='Office Episodes Watched by Week Day', rot=0);

#### #2

In [None]:
# set our categorical and define the order so the hours are plotted 0-23
office['hour'] = pd.Categorical(office['hour'], categories=list(range(0,24)), ordered=True)

office.dtypes

In [None]:
office_by_hour = office['hour'].value_counts()

office_by_hour.head()

In [None]:
office_by_hour = office_by_hour.sort_index()

office_by_hour.head()

In [None]:
office_by_hour.plot(kind='bar', figsize=(10,6), title='Office Episodes Watched by Hour', rot=0);