# Introduction to
***
![Pandas_logo](images/pandas.png)
Image source:https://en.wikipedia.org/wiki/Pandas_(software)
***
The most widely used Python library for data science

It has nothing to do with cute bears. Instead it stands for **Pan**el **da**ta - **Pandas**
***
![cute_pandas](images/cute_pandas.jpg)
Image source:https://wallpaper-house.com/wallpaper-id-399850.php



## Why Pandas ?
***
<img src="images/why-pandas.jpg" width="70%"/>



## Features of Pandas
***
<img src="images/pandas-features.jpg" width="70%"/>


# Pandas Data Structures
***
<img src="images/pandas-datastructures.jpg" width="70%"/>



# Pandas Series
***
* Very similar to a NumPy array.

* What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location.



## How to create a Series?
***
You can convert a list, numpy array, or dictionary to a Series. To create a series you call the function pd.Series(). Let's see an example of the same.

We will create a series from 
- list
- numpy array 
- dictionary

In [1]:
import pandas as pd

In [5]:
my_list = [10, 20, 30, 40]

In [3]:
series = pd.Series(my_list)

In [6]:
series.values

array([10, 20, 30])

In [7]:
#creating a pandas series from a list
import pandas as pd

my_list = [10, 20, 30, 40]
series = pd.Series(my_list)

print(series)
print(series.index)
print(series.values)

0    10
1    20
2    30
3    40
dtype: int64
RangeIndex(start=0, stop=4, step=1)
[10 20 30 40]


In [8]:
# creating a series from numPy Array
import numpy as np
import pandas as pd

index = ['a','b','c', 'd']
arr = np.array([10,20,30,40])

pd.Series(data=arr,index=index)

a    10
b    20
c    30
d    40
dtype: int64

In [9]:
# creating a series from dictionary
import pandas as pd

d = {'a':10, 'b':20, 'c':30, 'd':40}
pd.Series(d)

a    10
b    20
c    30
d    40
dtype: int64


## Using Index in a Series
***
* The key to using a Series is understanding its index.

* Pandas makes use of these index names or numbers by allowing for **fast lookups** of information (works like a hash table or dictionary).


In [13]:
# Custom index
import pandas as pd
ser1 = pd.Series([1,2,3,4,5], index=['USA', 'China','USSR', 'Japan', 'India']) 
ser2 = pd.Series([1,2,5,4,6], index=['USA', 'China','Italy', 'Japan', 'India'])   

# get the value of 'USA'
print(ser1['USA'])

1


In [14]:
ser2

USA      1
China    2
Italy    5
Japan    4
India    6
dtype: int64

In [15]:
print(ser1 + ser2)

China     4.0
India    11.0
Italy     NaN
Japan     8.0
USA       2.0
USSR      NaN
dtype: float64


# DataFrame Basics

***

We'll talk about
- How to create a DataFrame, the primary data structure in pandas 
- How to find the shape and rank of the created or existing DataFrames
- How to read DataFrames from a file
- What are indexes, and how do they work in the domain of Pandas DataFrames

## What are DataFrames?

***

DataFrames are a way to store data in rectangular grids that can easily be overviewed. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable. This means that a data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.
<br> </br>

Data frames in Python come within the Pandas library, and they are defined as a two-dimensional labeled data structures with columns of potentially different types.
<br> </br>

In general, you could say that the Pandas data frame consists of three main components: the data, the index, and the columns.


**Creating DataFrames manually**

The function that you will use is the Pandas Dataframe() function: it requires you to pass the data that you want to put in, the indices and the columns.

Remember that the data that is contained within the data frame doesn’t have to be homogenous.

In [16]:
import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3],
                   [3, 4, 5],
                   [5, 6, 7],
                   [7, 8, 9]])
df

Unnamed: 0,0,1,2
0,1,2,3
1,3,4,5
2,5,6,7
3,7,8,9


In [17]:
df = pd.DataFrame([[1, 2, 3], [3, 4, 5], [5, 6, 7], [7, 8, 9]])

print("Shape:", df.shape)
print("Index:", df.index)

df

Shape: (4, 3)
Index: RangeIndex(start=0, stop=4, step=1)


Unnamed: 0,0,1,2
0,1,2,3
1,3,4,5
2,5,6,7
3,7,8,9


**Understanding the Indexes**

Before you start with adding, deleting and renaming the components of your DataFrame, you first need to know how you can select these elements.

This is where Indexes come into play, just the way you can use an index page in a book to locate your chapters, you can use the loc() or iloc() function in pandas to access data in particular columns of your DataFrame.

We will learn about how these functions work and their subtle differences in the next sections.

In [18]:
df2 = pd.DataFrame([[1, 2, 3], [3, 4, 5], [5, 6, 7], [7, 8, 9]],
                  index= ['a','b','c','d'], columns=['x','y','z'])

In [19]:
df2

Unnamed: 0,x,y,z
a,1,2,3
b,3,4,5
c,5,6,7
d,7,8,9


In [20]:
df2 = pd.DataFrame([[1, 2, 3], [3, 4, 5], [5, 6, 7], [7, 8, 9]],
                   index=['a', 'b', 'c', 'd'], columns=['x', 'y', 'z'])

print("Shape:", df.shape)
print("Index:", df.index)

df2

Shape: (4, 3)
Index: RangeIndex(start=0, stop=4, step=1)


Unnamed: 0,x,y,z
a,1,2,3
b,3,4,5
c,5,6,7
d,7,8,9


# The Weather Dataset: Reading DataFrames from Files
***
The Weather Dataset is a time-series data set with per-hour information about the weather conditions at a particular location. It records Temperature, Dew Point Temperature, Relative Humidity, Wind Speed, Visibility, Pressure, and Conditions.

<img src="images/weather.jpg" alt="Weather" style="width: 200px;"/>

This data is available as a CSV file. We are going to use Pandas DataFrames and analyse this dataset.


In [35]:
# Read the data into a data frame

weather_df = pd.read_csv("data/weather_data.csv") # 

print("Shape:", weather_df.shape)
print("Index:", weather_df.index)

Shape: (350640, 10)
Index: RangeIndex(start=0, stop=350640, step=1)


In [36]:
weather_df.head()

Unnamed: 0,utc_timestamp,Country,AT_temperature,AT_radiation_direct_horizontal,AT_radiation_diffuse_horizontal,RO_radiation_direct_horizontal,RO_radiation_diffuse_horizontal,SE_temperature,Postal,Extract_Date
0,1980-01-01T00:00:00Z,USA,-3.64,0.0,0.0,0.0,0.0,-3.945,74493,1/1/1995
1,1980-01-01T01:00:00Z,USA,-3.803,0.0,0.0,0.0,0.0,-4.053,74493,1/2/1995
2,1980-01-01T02:00:00Z,USA,-3.969,0.0,0.0,0.0,0.0,-4.129,74493,1/3/1995
3,1980-01-01T03:00:00Z,USA,-4.076,0.0,0.0,0.0,0.0,-4.139,74493,1/4/1995
4,1980-01-01T04:00:00Z,USA,-4.248,0.0,0.0,0.0,0.0,-4.239,74493,1/5/1995


In [45]:
weather_df['utc_timestamp'].head()

0    1980-01-01T00:00:00Z
1    1980-01-01T01:00:00Z
2    1980-01-01T02:00:00Z
3    1980-01-01T03:00:00Z
4    1980-01-01T04:00:00Z
Name: utc_timestamp, dtype: object

**Let's convert the `Date/Time` column datatype from `object` to `timestamp` so that we can access the month directly using the attribute `dt.month`**

In [46]:
weather_df['utc_timestamp'] = pd.to_datetime(weather_df['utc_timestamp'])

In [33]:
weather_df['utc_timestamp'].head()

0   1980-01-01 00:00:00+00:00
1   1980-01-01 01:00:00+00:00
2   1980-01-01 02:00:00+00:00
3   1980-01-01 03:00:00+00:00
4   1980-01-01 04:00:00+00:00
Name: utc_timestamp, dtype: datetime64[ns, UTC]



# How to Analyze DataFrames?
***
The following functions help you understand and explore summaries of your data without having to view the whole DataFrame

## `.info()`
***
Provides a summary of a DataFrame: rows, columns, data types of columns (if automatically detected) and the memory usage.

For detailed summaries of the the DataFrame, you can pass optional arguments verbose=True and null_counts=True to the .info() method to output information for all of the columns

In [47]:
weather_df.info() # Bring the cursor inside the brackets of info() and hit shift+tab & see what you get.
                  # This will work for any function in Pandas

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350640 entries, 0 to 350639
Data columns (total 10 columns):
 #   Column                           Non-Null Count   Dtype              
---  ------                           --------------   -----              
 0   utc_timestamp                    350640 non-null  datetime64[ns, UTC]
 1   Country                          350640 non-null  object             
 2   AT_temperature                   350640 non-null  float64            
 3   AT_radiation_direct_horizontal   350640 non-null  float64            
 4   AT_radiation_diffuse_horizontal  350640 non-null  float64            
 5   RO_radiation_direct_horizontal   350640 non-null  float64            
 6   RO_radiation_diffuse_horizontal  350640 non-null  float64            
 7   SE_temperature                   350640 non-null  float64            
 8   Postal                           350640 non-null  int64              
 9   Extract_Date                     350640 non-null  object   

## `.head()`
***
It is used to preview a part of a large DataFrame, similar to the linux `head` command. This reduces time and resources required if  whole DataFrame was to be fetched instead. Shows the first N rows in the data (by default, N=5).

In [48]:
weather_df.head(5)

Unnamed: 0,utc_timestamp,Country,AT_temperature,AT_radiation_direct_horizontal,AT_radiation_diffuse_horizontal,RO_radiation_direct_horizontal,RO_radiation_diffuse_horizontal,SE_temperature,Postal,Extract_Date
0,1980-01-01 00:00:00+00:00,USA,-3.64,0.0,0.0,0.0,0.0,-3.945,74493,1/1/1995
1,1980-01-01 01:00:00+00:00,USA,-3.803,0.0,0.0,0.0,0.0,-4.053,74493,1/2/1995
2,1980-01-01 02:00:00+00:00,USA,-3.969,0.0,0.0,0.0,0.0,-4.129,74493,1/3/1995
3,1980-01-01 03:00:00+00:00,USA,-4.076,0.0,0.0,0.0,0.0,-4.139,74493,1/4/1995
4,1980-01-01 04:00:00+00:00,USA,-4.248,0.0,0.0,0.0,0.0,-4.239,74493,1/5/1995


## `.index`
***
This attribute provides the `index` of the dataframe.

Indexing identifies data using known indicators that allows intuitive getting and setting of subsets of the data set.

A major advantage of Pandas over NumPy is that each of the columns and rows has a label. Working with column positions is possible, but it can be hard to keep track of which number corresponds to which column.

We can work with labels using the **pandas.DataFrame.loc** method, which allows us to index using labels instead of positions.

In [49]:
weather_df.index

RangeIndex(start=0, stop=350640, step=1)

## `.unique()`
***
This method, which belongs to the `Series` object, can be useful when trying to identify unique values in a column.
- Uniques are returned in order of appearance. 
- It is significantly faster than numpy.unique and includes N/A values

In [51]:
weather_df['AT_temperature']

0        -3.640
1        -3.803
2        -3.969
3        -4.076
4        -4.248
          ...  
350635   -1.386
350636   -1.661
350637   -1.986
350638   -2.184
350639   -2.271
Name: AT_temperature, Length: 350640, dtype: float64

In [52]:
weather_df['AT_temperature'].unique()

array([-3.64 , -3.803, -3.969, ..., 26.798, 25.578, 22.565],
      shape=(42301,))

## `.nunique()`
***
This method belongs to the `Series` object and can be useful when trying to identify the number of unique values in a column. 
- Excludes NA values by default
- Always returns an integer value

In [53]:
weather_df['AT_temperature'].nunique()

42301

## `.value_counts()`
***
This method, which belongs to the `Series` object, can be useful when trying to identify unique values and their counts in a column
- The resulting object will be in descending order so that the first element is the most frequently-occurring element. 
- Excludes NA values by default.

In [54]:
weather_df['AT_temperature'].value_counts()

AT_temperature
 14.562    29
 1.186     28
 14.280    27
-0.608     27
-0.256     27
           ..
 30.304     1
 31.653     1
 31.469     1
 30.710     1
 30.198     1
Name: count, Length: 42301, dtype: int64


# Data Manipulation : Gets you desired results
***
The true power of the Pandas DataFrame is the ease and flexibility of manipulating data to get your desired results.

Pandas is best at handling tabular data sets comprising different variable types (integer, float, double, etc.). 

In addition, the pandas library can also be used to perform even the most naive of tasks such as loading data or doing feature engineering on time series data.

## Selection (Part 1)
***
How do you select particular rows/columns from the DataFrame ?

The DataFrame object supports indexing operations just like the Python `list` class and the Pandas Series object, but is much faster and more powerful.

Note that when you extract a single row or column, you get a one-dimensional object as output. That is called a pandas Series. The values on the left are just labels taken from the dataframe index. 

On the other hand, when we extract portions of a pandas dataframe, we get a two-dimensional DataFrame type of object. Something to keep in mind for later.

### How to get the Weather column from the "weather_df" dataframe

In [55]:
col = weather_df['Postal']

print(type(col))
col.head()

<class 'pandas.core.series.Series'>


0    74493
1    74493
2    74493
3    74493
4    74493
Name: Postal, dtype: int64

### How to get the Postal and Country columns from the "weather_df" dataframe

In [56]:
two_cols = weather_df[['Postal', 'Country']] # Take a good look at those brackets. There are two sets of them
                                               # to access more than one columns. 
print(type(two_cols))
two_cols.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Postal,Country
0,74493,USA
1,74493,USA
2,74493,USA
3,74493,USA
4,74493,USA


**Keep in mind Whenever you need to take more than two columns need to put double sqaure bracket [[]] like above example**

## **Get the first 25 rows from the "weather_df" dataframe**
***
**Important**: This slicing would work even if the row index had non-numeric labels, because slicing works here the same way as a list

In [57]:
weather_df[:25]

Unnamed: 0,utc_timestamp,Country,AT_temperature,AT_radiation_direct_horizontal,AT_radiation_diffuse_horizontal,RO_radiation_direct_horizontal,RO_radiation_diffuse_horizontal,SE_temperature,Postal,Extract_Date
0,1980-01-01 00:00:00+00:00,USA,-3.64,0.0,0.0,0.0,0.0,-3.945,74493,1/1/1995
1,1980-01-01 01:00:00+00:00,USA,-3.803,0.0,0.0,0.0,0.0,-4.053,74493,1/2/1995
2,1980-01-01 02:00:00+00:00,USA,-3.969,0.0,0.0,0.0,0.0,-4.129,74493,1/3/1995
3,1980-01-01 03:00:00+00:00,USA,-4.076,0.0,0.0,0.0,0.0,-4.139,74493,1/4/1995
4,1980-01-01 04:00:00+00:00,USA,-4.248,0.0,0.0,0.0,0.0,-4.239,74493,1/5/1995
5,1980-01-01 05:00:00+00:00,USA,-4.527,0.0,0.0,0.0005,0.0514,-4.335,74493,1/6/1995
6,1980-01-01 06:00:00+00:00,USA,-4.84,0.0239,0.4413,0.2407,12.0231,-4.384,74493,1/7/1995
7,1980-01-01 07:00:00+00:00,USA,-4.703,4.6844,33.5061,1.0967,42.8109,-4.468,74493,1/8/1995
8,1980-01-01 08:00:00+00:00,USA,-3.835,30.1528,90.8233,2.8222,79.1826,-4.362,74493,1/9/1995
9,1980-01-01 09:00:00+00:00,USA,-2.804,61.2307,126.7209,5.5566,111.5644,-3.875,74493,1/10/1995



**How to get the first 3 alternating rows from the "weather_df" dataframe, but only the utc_timestamp and Extract_Date columns**


In [58]:
weather_df[['utc_timestamp','Extract_Date']][0:10:2]

Unnamed: 0,utc_timestamp,Extract_Date
0,1980-01-01 00:00:00+00:00,1/1/1995
2,1980-01-01 02:00:00+00:00,1/3/1995
4,1980-01-01 04:00:00+00:00,1/5/1995
6,1980-01-01 06:00:00+00:00,1/7/1995
8,1980-01-01 08:00:00+00:00,1/9/1995


In [59]:
# solution 1
result1 = weather_df[:6:2][['utc_timestamp','Extract_Date']]

# solution 2
result2 = weather_df[['utc_timestamp','Extract_Date']][:6:2]

# are they the same?
result1 == result2

Unnamed: 0,utc_timestamp,Extract_Date
0,True,True
2,True,True
4,True,True


#### So which of the two solutions should you use?
***
**Answer**: Neither. Because we're indexing more than once (Chained Indexing).
When you use chained indexing, the order and type of the indexing operation partially determine whether the result is a slice into the original object, or a copy of the slice.

Let's analyse (break down) one of the above solutions.

In [60]:
# first indexing
df1 = weather_df[:6:2]

# second indexing
df2 = df1[['utc_timestamp','Extract_Date']]

While both results are correct in this **read-only** case, chained indexing may give unpredictable behaviours when **writing** to a dataframe.

This is because indexing could either return a "view" (of slices of the dataframe), or a copy of the dataframe.

## Selection (Part 2)
***
Pandas provides a powerful way to work with both rows and columns together, optionally using their label indices or numeric indices.

- **`.loc :`**<br/>
Purely label-location based indexer for selection by label (but may also be used with a boolean array).<br/>
**Important: If you use slicing in loc, it will return the end index as well**
<br/><br/>

- **`.iloc:`**<br/>
Purely integer-location based indexing for selection by position (but may also be used with a boolean array).

Allowed inputs are:
- A single label, e.g. 5 or 'a'
- A list or array of labels, e.g. ['a', 'b', 'c']
- A slice object with labels, e.g. 'a':'f'


In [61]:
weather_df.loc[0:5, ['utc_timestamp','Extract_Date']]

Unnamed: 0,utc_timestamp,Extract_Date
0,1980-01-01 00:00:00+00:00,1/1/1995
1,1980-01-01 01:00:00+00:00,1/2/1995
2,1980-01-01 02:00:00+00:00,1/3/1995
3,1980-01-01 03:00:00+00:00,1/4/1995
4,1980-01-01 04:00:00+00:00,1/5/1995
5,1980-01-01 05:00:00+00:00,1/6/1995


In [62]:
weather_df[['utc_timestamp','Extract_Date']].iloc[0:5]

Unnamed: 0,utc_timestamp,Extract_Date
0,1980-01-01 00:00:00+00:00,1/1/1995
1,1980-01-01 01:00:00+00:00,1/2/1995
2,1980-01-01 02:00:00+00:00,1/3/1995
3,1980-01-01 03:00:00+00:00,1/4/1995
4,1980-01-01 04:00:00+00:00,1/5/1995



## Filtering
***

![Filter](images/filters1.jpg)
Image Source:https://pixabay.com/en/yashica-filter-camera-vintage-711794/
<br/>

Anything that takes in data, processes it, and provides an output

Input Data ⟶ Filter ⟶ Output Data

Filtering rows of a DataFrame is an almost mandatory task for Data Analysis with Python. Given a Data Frame, we may not be interested in the entire dataset but only in specific rows.

### Find all instances when snow was recorded
***
Whether or not it snowed can be found out using the Weather column.

Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

In [68]:
# Read the data into a data frame

weather_df = pd.read_csv("data/weather_2012.csv") 

print("Shape:", weather_df.shape)
print("Index:", weather_df.index)

Shape: (8784, 8)
Index: RangeIndex(start=0, stop=8784, step=1)


In [69]:
weather_df.head()

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather
0,2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog
1,2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog
2,2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog"
3,2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog"
4,2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog


In [70]:
weather_df['Weather'].unique()

array(['Fog', 'Freezing Drizzle,Fog', 'Mostly Cloudy', 'Cloudy', 'Rain',
       'Rain Showers', 'Mainly Clear', 'Snow Showers', 'Snow', 'Clear',
       'Freezing Rain,Fog', 'Freezing Rain', 'Freezing Drizzle',
       'Rain,Snow', 'Moderate Snow', 'Freezing Drizzle,Snow',
       'Freezing Rain,Snow Grains', 'Snow,Blowing Snow', 'Freezing Fog',
       'Haze', 'Rain,Fog', 'Drizzle,Fog', 'Drizzle',
       'Freezing Drizzle,Haze', 'Freezing Rain,Haze', 'Snow,Haze',
       'Snow,Fog', 'Snow,Ice Pellets', 'Rain,Haze', 'Thunderstorms,Rain',
       'Thunderstorms,Rain Showers', 'Thunderstorms,Heavy Rain Showers',
       'Thunderstorms,Rain Showers,Fog', 'Thunderstorms',
       'Thunderstorms,Rain,Fog',
       'Thunderstorms,Moderate Rain Showers,Fog', 'Rain Showers,Fog',
       'Rain Showers,Snow Showers', 'Snow Pellets', 'Rain,Snow,Fog',
       'Moderate Rain,Fog', 'Freezing Rain,Ice Pellets,Fog',
       'Drizzle,Ice Pellets,Fog', 'Drizzle,Snow', 'Rain,Ice Pellets',
       'Drizzle,Snow,Fog', 

In [72]:
snowed_filter = weather_df['Weather'].str.contains('Snow')

In [74]:
snowed_filter

0       False
1       False
2       False
3       False
4       False
        ...  
8779     True
8780     True
8781     True
8782     True
8783     True
Name: Weather, Length: 8784, dtype: bool

In [75]:
weather_df[snowed_filter]

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather
41,2012-01-02 17:00:00,-2.1,-9.5,57,22,25.0,99.66,Snow Showers
44,2012-01-02 20:00:00,-5.6,-13.4,54,24,25.0,100.07,Snow Showers
45,2012-01-02 21:00:00,-5.8,-12.8,58,26,25.0,100.15,Snow Showers
47,2012-01-02 23:00:00,-7.4,-14.1,59,17,19.3,100.27,Snow Showers
48,2012-01-03 00:00:00,-9.0,-16.0,57,28,25.0,100.35,Snow Showers
...,...,...,...,...,...,...,...,...
8779,2012-12-31 19:00:00,0.1,-2.7,81,30,9.7,100.13,Snow
8780,2012-12-31 20:00:00,0.2,-2.4,83,24,9.7,100.03,Snow
8781,2012-12-31 21:00:00,-0.5,-1.5,93,28,4.8,99.95,Snow
8782,2012-12-31 22:00:00,-0.2,-1.8,89,28,9.7,99.91,Snow


In [76]:
# Basically, we want a way to "filter out" records that have the word "snow" (case insensitive) in the last column

snowed_filter = weather_df['Weather'].str.lower().str.contains('snow')
weather_df[snowed_filter]

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather
41,2012-01-02 17:00:00,-2.1,-9.5,57,22,25.0,99.66,Snow Showers
44,2012-01-02 20:00:00,-5.6,-13.4,54,24,25.0,100.07,Snow Showers
45,2012-01-02 21:00:00,-5.8,-12.8,58,26,25.0,100.15,Snow Showers
47,2012-01-02 23:00:00,-7.4,-14.1,59,17,19.3,100.27,Snow Showers
48,2012-01-03 00:00:00,-9.0,-16.0,57,28,25.0,100.35,Snow Showers
...,...,...,...,...,...,...,...,...
8779,2012-12-31 19:00:00,0.1,-2.7,81,30,9.7,100.13,Snow
8780,2012-12-31 20:00:00,0.2,-2.4,83,24,9.7,100.03,Snow
8781,2012-12-31 21:00:00,-0.5,-1.5,93,28,4.8,99.95,Snow
8782,2012-12-31 22:00:00,-0.2,-1.8,89,28,9.7,99.91,Snow


**Find all instances when wind speed was above 24 and visibility was 25**

In [77]:
df = weather_df[(weather_df['Wind Spd (km/h)'] > 24) & (weather_df['Visibility (km)']== 25)]
df.head()

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather
23,2012-01-01 23:00:00,5.3,2.0,79,30,25.0,99.31,Cloudy
24,2012-01-02 00:00:00,5.2,1.5,77,35,25.0,99.26,Rain Showers
25,2012-01-02 01:00:00,4.6,0.0,72,39,25.0,99.26,Cloudy
26,2012-01-02 02:00:00,3.9,-0.9,71,32,25.0,99.26,Mostly Cloudy
27,2012-01-02 03:00:00,3.7,-1.5,69,33,25.0,99.3,Mostly Cloudy


## Summary / Cheatsheet: Selection/Indexing/Filtering
***
This is a handy reminder for what syntax will get what result.

Syntax | Function | Remarks
:--- | :--- | :---
**`df['some_label']`** |  Get the (single) Column referenced by name `some_label` | A **str** is provided
**`df[['label1', 'label2']]`** | Get multiple columns referenced by given names | A **list** is provided 
**`df[start:end:step]`** | Get corresponding rows (same as list slicing) | A **slicing operator**<br/> is provided
**`df[boolean array/df]`** | Get corresponding rows (same as list slicing) | A **filter object** is provided
**`df.loc [row_sel, col_sel]`** | Select specified rows and columns (by labels) | 
**`df.iloc[row_sel, col_sel]`** | Select specified rows and columns (by index) | 



## Working with Columns
***
- We will learn how to carry out Series operations on DataFrame Columns
- How to add or update columns within a DataFrame
- How to rename specific columns
- How to delete or drop a column that is no longer required for analysis

### Series Operations
***
A series is a one-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, , *) align values based on their associated index values– they need not be the same length. The result index after operation will be the sorted union of the two indexes.

Add 10 to the values in the column "Wind Spd (km/h)" using the "+" operator

In [78]:
weather_df["Wind Spd (km/h)"].head() * 4 / 8

0    2.0
1    2.0
2    3.5
3    3.0
4    3.5
Name: Wind Spd (km/h), dtype: float64

In [79]:
add_10 = weather_df["Wind Spd (km/h)"] + 20
add_10.head()

0    24
1    24
2    27
3    26
4    27
Name: Wind Spd (km/h), dtype: int64

Multiply the values in the 'Visibility (km)' column by 2 using the asterisk (*) operator

In [80]:
mult_2 = weather_df['Visibility (km)'] * 2
mult_2.head()

0    16.0
1    16.0
2     8.0
3     8.0
4     9.6
Name: Visibility (km), dtype: float64

Add the "Temp (C)" and "Dew Point Temp (C)" columns as series "temperature"

THis can be done by simply passing the column names and using the "+" operator

In [81]:
weather_df['new_temp_col'] = weather_df["Temp (C)"] + weather_df["Dew Point Temp (C)"]

In [82]:
weather_df.head()

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,new_temp_col
0,2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog,-5.7
1,2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog,-5.5
2,2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog",-5.2
3,2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog",-4.7
4,2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog,-4.8


In [83]:
temperature = weather_df["Temp (C)"] + weather_df["Dew Point Temp (C)"]
temperature.head()

0   -5.7
1   -5.5
2   -5.2
3   -4.7
4   -4.8
dtype: float64

## Apply / Call Functions

## `.apply()`
***

You can pass any number of arguments to the function that apply is calling through either unnamed arguments, passed as a tuple to the args parameter, or through other keyword arguments internally captured as a dictionary by the kwds parameter.

Invoke function on values of Series. Can be a NumPy function that applies to the entire Series or a Python function that only works on single values

In [84]:
# Applying custom functions

def times2(value):
    return value * 2


In [85]:
weather_df['Visibility (km)'].head()

0    8.0
1    8.0
2    4.0
3    4.0
4    4.8
Name: Visibility (km), dtype: float64

In [86]:
weather_df['Visibility (km)'].apply(times2).head()


0    16.0
1    16.0
2     8.0
3     8.0
4     9.6
Name: Visibility (km), dtype: float64

## `.describe()`
***

The above function is used to summarize the  central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [87]:
weather_df['Visibility (km)'].describe()

count    8784.000000
mean       27.664447
std        12.622688
min         0.200000
25%        24.100000
50%        25.000000
75%        25.000000
max        48.300000
Name: Visibility (km), dtype: float64

### Adding/Updating Columns

In [88]:
weather_df["Visibility (m)"] = weather_df["Visibility (km)"] * 1000  

In [89]:
visibility_in_meter = weather_df["Visibility (km)"] * 1000  
weather_df["Visibility (m)"] = visibility_in_meter

weather_df.head()

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,new_temp_col,Visibility (m)
0,2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog,-5.7,8000.0
1,2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog,-5.5,8000.0
2,2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog",-5.2,4000.0
3,2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog",-4.7,4000.0
4,2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog,-4.8,4800.0


### Renaming Columns

## `.rename()`
***

Alter Series index labels or name. It will replace the existing names with the names you provide, in the order you provide.

You can also assign the names by index.

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [90]:
# Notice the "inplace=True" parameter. This means the renaming has been assigned in the old DataFrame itself

weather_df.rename(columns={'Visibility (m)': 'Visibility (meters)'}, inplace=True)
weather_df.head()

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,new_temp_col,Visibility (meters)
0,2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog,-5.7,8000.0
1,2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog,-5.5,8000.0
2,2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog",-5.2,4000.0
3,2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog",-4.7,4000.0
4,2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog,-4.8,4800.0


### Deleting Columns

## `.drop()`
***

Return new object with labels in requested axis removed.

Note that Pandas uses zero based numbering, so 0 is the first row, 1 is the second row, etc.
You can select ranges relative to the top or drop relative to the bottom of the DataFrame as well.

- Note: Specifying both labels and index or columns will raise a ValueError.

In [91]:
# Since we have not mentioned inplace=True, it returns a new dataframe.
weather_df.drop(labels=['Visibility (meters)'], axis=1)

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,new_temp_col
0,2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog,-5.7
1,2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog,-5.5
2,2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog",-5.2
3,2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog",-4.7
4,2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog,-4.8
...,...,...,...,...,...,...,...,...,...
8779,2012-12-31 19:00:00,0.1,-2.7,81,30,9.7,100.13,Snow,-2.6
8780,2012-12-31 20:00:00,0.2,-2.4,83,24,9.7,100.03,Snow,-2.2
8781,2012-12-31 21:00:00,-0.5,-1.5,93,28,4.8,99.95,Snow,-2.0
8782,2012-12-31 22:00:00,-0.2,-1.8,89,28,9.7,99.91,Snow,-2.0




## Sorting
***

## `.sort_values()`
***

Sort by the values along either axis, in a user specified order. The order can be specified by selecting true or false for the "ascending" parameter.

In [92]:
sorted_by_temp = weather_df.sort_values('Temp (C)', ascending=False)  # can be inplace as well
sorted_by_temp.head()

Unnamed: 0,Date/Time,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,new_temp_col,Visibility (meters)
4143,2012-06-21 15:00:00,33.0,19.0,44,24,24.1,100.2,Mainly Clear,52.0,24100.0
4695,2012-07-14 15:00:00,33.0,16.8,38,22,48.3,101.31,Mainly Clear,49.8,48300.0
4696,2012-07-14 16:00:00,32.9,15.3,35,24,48.3,101.26,Mainly Clear,48.2,48300.0
5199,2012-08-04 15:00:00,32.8,18.8,44,17,24.1,101.39,Clear,51.6,24100.0
4694,2012-07-14 14:00:00,32.7,15.3,35,28,48.3,101.35,Mainly Clear,48.0,48300.0


### Which were the top 10 hottest values and their counts?

In [93]:
sorted_value_counts = weather_df['Temp (C)'].value_counts().sort_values(ascending=False)
sorted_value_counts.iloc[:10]

Temp (C)
16.6    65
1.1     58
0.8     47
1.5     45
19.3    44
21.1    43
2.6     43
0.4     41
1.3     40
14.6    39
Name: count, dtype: int64


# Pivot Tables : Excellent way to Summarize your Data!
***
- A pivot table is a tool that allows you to reorganize and summarize selected columns and rows of data in a dataframe <br/><br/>

- Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean <br/><br/>

- Pivot tables first groups and only then applies a calculation

In [95]:
data = {
    'A': ['foo','foo','foo','bar','bar','bar'],
    'B': ['one','one','two','two','one','one'],
    'C': ['x','y','x','y','x','y'],
    'D': [1, 3, 2, 5, 4, 1]
}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C,D
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


In [96]:
pivot_df = df.pivot_table(
                values='D',      # We want to aggregate the values of which column?
                index='A',       # We want to use which column as the new index?
                columns=['C'],   # We want to use the values of which column as the new columns? (optional)
                aggfunc=np.sum)  # What aggregation function to use ?


pivot_df

  pivot_df = df.pivot_table(


C,x,y
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,4,6
foo,3,3


In [97]:
# convert it back to a simple index

pivot_df.reset_index()

C,A,x,y
0,bar,4,6
1,foo,3,3


### What is the mean temperature recorded by month?

In [98]:
weather_df[['Temp (C)','Date/Time']]

Unnamed: 0,Temp (C),Date/Time
0,-1.8,2012-01-01 00:00:00
1,-1.8,2012-01-01 01:00:00
2,-1.8,2012-01-01 02:00:00
3,-1.5,2012-01-01 03:00:00
4,-1.5,2012-01-01 04:00:00
...,...,...
8779,0.1,2012-12-31 19:00:00
8780,0.2,2012-12-31 20:00:00
8781,-0.5,2012-12-31 21:00:00
8782,-0.2,2012-12-31 22:00:00


In [104]:
# Convert to datetime format
weather_df['Date/Time'] = pd.to_datetime(weather_df['Date/Time'])
weather_df['Month'] = weather_df['Date/Time'].dt.month

weather_df['Date/Time'].dt.month

0        1
1        1
2        1
3        1
4        1
        ..
8779    12
8780    12
8781    12
8782    12
8783    12
Name: Date/Time, Length: 8784, dtype: int32

In [105]:
mean_temperature_df = weather_df.pivot_table(values='Temp (C)', index=weather_df['Date/Time'].dt.month, aggfunc=np.mean)
mean_temperature_df # the numbers 1 to 12 denote the respective months from January to December.

  mean_temperature_df = weather_df.pivot_table(values='Temp (C)', index=weather_df['Date/Time'].dt.month, aggfunc=np.mean)


Unnamed: 0_level_0,Temp (C)
Date/Time,Unnamed: 1_level_1
1,-7.371505
2,-4.225
3,3.121237
4,7.009306
5,16.237769
6,20.134028
7,22.790054
8,22.279301
9,16.484444
10,10.954973



# Group By
***
The groupby method allows you to group rows of data together and call aggregate functions that applies to the whole group.

Any groupby operation involves one of the following operations on the original object. They are −
- Splitting the Object
- Applying a function
- Combining the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

- Aggregation − computing a summary statistic
- Transformation − perform some group-specific operation
- Filtration − discarding the data with some condition

In [112]:
numeric_cols = weather_df.select_dtypes(include=[np.number]).columns
grouped_df = weather_df.groupby(weather_df['Date/Time'].dt.month)[numeric_cols].mean().reset_index()

grouped_df.rename(columns={'Date/Time': 'Month'}, inplace=True)


# Concat, Merge and Join 
<br/>

***
There are 3 key ways of combining DataFrames together:

- **Concatenation**: Concatenation glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on <br/><br/>
- **Merging**: The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together<br/><br/>
- **Join**: Join is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame

In [115]:
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({
    'A': ['A4', 'A5', 'A6', 'A7'],
    'B': ['B4', 'B5', 'B6', 'B7'],
    'C': ['C4', 'C5', 'C6', 'C7'],
    'D': ['D4', 'D5', 'D6', 'D7']
}, index=[4, 5, 6, 7])

df3 = pd.DataFrame({
    'A': ['A8', 'A9', 'A10', 'A11'],
    'B': ['B8', 'B9', 'B10', 'B11'],
    'C': ['C8', 'C9', 'C10', 'C11'],
    'E': ['D8', 'D9', 'D10', 'D11']
}, index=[8, 9, 10, 11])

In [116]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [117]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [118]:
df3

Unnamed: 0,A,B,C,E
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


In [119]:
# if you don't specify an axis, it defaults to axis=0, which means it appends to rows
pd.concat([df1, df2, df3])

Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
4,A4,B4,C4,D4,
5,A5,B5,C5,D5,
6,A6,B6,C6,D6,
7,A7,B7,C7,D7,
8,A8,B8,C8,,D8
9,A9,B9,C9,,D9


In [120]:
# axis=1 means concat along columns

pd.concat([df1, df2, df3], axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,E
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,
2,A2,B2,C2,D2,,,,,,,,
3,A3,B3,C3,D3,,,,,,,,
4,,,,,A4,B4,C4,D4,,,,
5,,,,,A5,B5,C5,D5,,,,
6,,,,,A6,B6,C6,D6,,,,
7,,,,,A7,B7,C7,D7,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


## Join
***
Simply join two DFs having potentially different row indices

You can do both inner as well as outer joins using the join function in pandas
- Parameters {‘inner’, ‘outer’}, default ‘outer’. Outer for union and inner for intersection.

In [121]:
# Join
left_df = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
}, index=['K0', 'K1', 'K2']) 

right_df = pd.DataFrame({
    'C': ['C0', 'C2', 'C3'],
    'D': ['D0', 'D2', 'D3']
}, index=['K0', 'K2', 'K3'])

In [122]:
left_df

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [123]:
right_df

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [124]:
left_df.join(right_df, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


In [125]:
left_df.join(right_df, how='inner')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2


## Merge

Many a times you will be working with multiple dataframes all at once.

The merge function allows them to be combined into a single data frame

In [126]:
# Merging on multiple keys
left = pd.DataFrame({
    'key1': ['K0', 'K0', 'K1', 'K2'],
    'key2': ['K0', 'K1', 'K0', 'K1'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})
    
right = pd.DataFrame({
    'key1': ['K0', 'K1', 'K1', 'K2'],
    'key2': ['K0', 'K0', 'K0', 'K0'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
})


In [127]:
left

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


In [128]:
right

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


In [129]:
# other options are 'inner', 'left', 'right'
pd.merge(left, right, how='left', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


# Further Reading
***
- Pandas documentation: http://pandas.pydata.org/
- 10 minutes to pandas: https://pandas.pydata.org/pandas-docs/stable/10min.html
- Cookbook- Useful Pandas Recipes: https://pandas.pydata.org/pandas-docs/stable/cookbook.html
- Pandas and Python Top 10: http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/
- Intro to Pandas Data Structures: http://www.gregreda.com/2013/10/26/