![SheLovesData](https://shelovesdata.com/wp-content/uploads/2018/06/she-loves-data-wide@2xv2.png)

# Introduction to Python Workshop - Day 2

***
Overview
    1. Library imports
    2. Pandas Series
    3. Pandas DataFrames
    
***

![round2](https://tenor.com/view/lets-go-round2-war-paint-pretty-gif-16173723.gif)

To start our script, we will be importing the libaries in the form of import statements. All the libraries in this demo will come out of the box with the [anaconda](https://docs.anaconda.com/anaconda/packages/py3.7_win-64/) distribution. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

**Task**:

Run `import pandas as pd` below: 

Now type `pd.` and click `Tab`. You will see a dropdown that shows all the different operations you can do. Remember python is **case-sensitive**.

*** 

## Series

A **Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [None]:
s

### Check Type
We can do many of the same things we did with lists before with a pd.Series(). We can check the **type**:

### Subsetting
We can pull the first value (don't forget that zero-indexing!)

We can pull a **range of values**:

**Task**:

Print the second element in the series

### Operations
Also, we can run many of the same **operations** like we did with lists:
    
#### Arithmetic operators:
<table><thead>
<tr>
<th style="text-align: center">Operator</th>
<th>What it means</th>
</tr>
</thead><tbody>
<tr>
<td style="text-align: center">+</td>
<td>Addition</td>
</tr>
<tr>
<td style="text-align: center">-</td>
<td>Subtraction</td>
</tr>
<tr>
<td style="text-align: center">*</td>
<td>Multiplication</td>
</tr>
<tr>
<td style="text-align: center">/</td>
<td>Division</td>
</tr>
<tr>
<td style="text-align: center">**</td>
<td>Exponentiation</td>
</tr>
<tr>
</tbody></table>

### Indexing

An additional component of a Series is the **index**, which can represent anything you'd like

In [None]:
s.rename({0:'date', 1:'apple', 2:'orange', 3:'mango' , 4:'pear' , 5:'cucumber' }, inplace=True)

I can now subset on this new index.

In [None]:
s['date']

I can also pull out more than than one element from the Series. Notice the syntax has two brackets! If you don't include two brackets, you will get an error.

**Task**:

Print out the number of cucumbers

We can also subset based on a condition. Let's say we wanted to pull all the fruits and veggies that I have very few of: 

I can also **name** the series. You'll notice at the bottom of the printed Series, a new label appeared `Name: counts`

In [None]:
s = s.rename("counts")
s

**Task**: 
1. Take a look at [this](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro) documentaion and make your own series. 
2. Check the type 
3. Print the 3rd element
4. Apply an operation of your choice to the Series
5. Change the index to colors
6. Subset the Series

### Advanced Subsetting

If you wanted to subset the Series for a random subset: 

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [None]:
np.random.seed(seed=3)

In [None]:
rand_index = np.random.randint(0, len(s), 3, )
rand_index

### Handling Missing Values

### Task
Fill the missing values with your favorite number and reassign to s

***

## Dataframes

**DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

We will be getting familiar with dataframes using the YouTube Dataset found [here](https://www.kaggle.com/datasnaek/youtube-new).

***

### Questions to ask yourself at this point..

After looking at the dataset, what question would we be interested to know? Maybe: what category is the most popular? 

To answer this question, what would we have to do?

***

![](https://media.giphy.com/media/d1E1YlkOTe4IfdNC/giphy.gif)

### Easiest way to make a DataFrame--Reading in a csv file

A few things to make sure of when reading data into the notebook:
1. Where the data is located relative to the notebook
2. The file extention 
3. The encoding

Before running any analysis, there are several things you will want to check. First and foremost, did the data read in correctly! 

You can check this by looking at the `head()` of the data

In [None]:
data.head() # <--- you can put a value in this function and look at top X rows. The default is 5.

![wow](https://media.giphy.com/media/sjDV6YTbw8tig/giphy.gif)

You can call `describe()` on the data which will give you descriptive statistics of all **numerical** variables.

In [None]:
data.describe()

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
# data.describe(include='all')

You can look at the shape of the dataset, which will give you the number of rows and number of columns in the dataset

![](https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png)

To get the number of **rows**:

In [None]:
data.shape[0]

**Task**:

- Print the number of columns in the dataframe

**Challenge**:

- Can you think of another way to get number of rows?
- Can you think of another way to get the number of columns?

We can also look at the datatypes for all the variables in the dataset using dytpes

In [None]:
data.dtypes

### Checking Null Values

A very important step when working with a dataset is checking for missing values. If you skip this step, you may end up with erroneous results. Handling missing values and investigating the dataset properly before analysis is VERY IMPORTANT. I'm sure you've heard this before, but..

Garbage in, garbage out
![d](https://media.giphy.com/media/3oEduNF7DlpxgcHVJe/giphy.gif)

In [None]:
data.isnull()

In [None]:
data.category_id.unique()

### Task 
How many unique categories are there?

**Task**:

Let's pull in the category labels using the filepath below and store in a dataframe called `categories`.

***

## Merging

In order to get the proper labels into our dataset, we will need to **merge** these two dataframes. There are few different ways we can do that. 

![](https://www.practicalecommerce.com/wp-content/uploads/2019/07/Data-join-570x421.jpg)

**Task**:

Before we merge, let's remind ourselves what the data looks like and what we will be joining on and what join we want.

Print out the first 5 rows of `data` and `categories`.

Few things we need before we can join: 
1. **Names** of the join key(s) need to be the same across datasets
2. **Datatype** of the join key(s) need to be the same across datasets

In [None]:
us_youtube.category.value_counts()

In [None]:
len(us_youtube.video_id.unique())

What is going on here? Why is there 40k+ rows but only 6k unique video ids? 

![no](https://miro.medium.com/proxy/0*kVETqwtYsQ8BFJDc.gif)

Let's understand what's going on by looking at one video id..

In [None]:
data[data.video_id=='1ZAPwfrtAFY']

***

## Data Aggregations using Groupby

Group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.


See [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

![](https://www.datasciencemadesimple.com/wp-content/uploads/2020/05/Generic-Groupby-mean-1.png)

In [None]:
grouped = us_youtube.groupby(['category']).agg({'video_id':'nunique'}).reset_index().sort_values('video_id', ascending=False)
grouped

In [None]:
fig = plt.figure(figsize=(15,10))
sns.barplot('category', 'video_id', data=grouped)
plt.xticks(rotation=90)

***

## Project

Pick another country and compare the results with that of the US.

Instructions:
- Import the dataset
- Check for missing values
- Look at descriptive statistics
- Merge with categories
- Pivot to find which category has the highest videos uploaded