# Pandas -- Data Analysis Library
- Used extensively by Data Scientist/Data Analyst to perform EDA (Exploratory Data Analysis)
- EDA is all about getting comfortable with the data such that you can answer any type of question
raised by your stakeholders / client

In [2]:
import numpy as np
import pandas as pd

# Data Containers

## Series

Series: It is a one-dimensional array-like structure used to represent a dataset and can be visualized as **a single column dataset**. It supports multiple data types, such as Integer, string, float.

Series can be created in multiple ways with the help of data elements which, if defined properly, act as data input to create a series. Therefore, data input can be an ndarray, dict, scalar, or a list. Let’s take a look at each one in detail.

Now, let’s see how we can create a series.

### List

This basic Python data structure which can act as an input to create Pandas series. List can hold a range of values of multiple data types. So, if a dataset appears as list, use list as input to create series.

In [None]:
print (list('abcdef'))

In [None]:
# Pass list as an argument

first_series = pd.Series(list('abcedef'))
print(first_series)

***Shows index, data value and data type***

***We have not created index for data but notice that data alignment is done automatically.***

### ndarray
    
An ndarray can be used as an input to create Pandas series. The use of ndarray is recommended wherever the dataset is number-centric and requires complex numerical computing. 

In [None]:
# ndarray for countries
np_countries = np.array(['Algeria','Angola','Argentina','Australia','Austria','Bahamas','Bangladesh','Belarus','Belgium',
                      'Bhutan','Brazil','Bulgaria','Cambodia','Cameroon','Chile','China','Colombia','Cyprus','Denmark'])
np_countries


In [None]:
# Pass ndarray as an argument

s_countries = pd.Series(np_countries)
print (s_countries)

### dict
A Pandas series can also be created using dictionary and it is very efficient when it comes to indexing or reindexing a dataset for data wrangling purposes. dict works in a key-value fashion, so use it whenever the dataset is structured as key-value pair.

In [None]:
dictionary = {"A" : 20, "B" : 35, 'C': 100}
print (dictionary)

In [None]:
# Pass dictionary as an argument

series = pd.Series(dictionary)
print(series) 

### Vectorized operations

Vectorized operations show you how you can add two or more series. The vector operations are essentially performed by the index positions of data elements.

The first example shows how the two series, ‘first_vector_series’ and ‘second_vector_series’ are added and this is done at index level. 

In [None]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10,20,30,40], index = ['a','b','c','d'])

print (first_vector_series)
print ()
print (second_vector_series)

In [None]:
print (first_vector_series + second_vector_series)

Let’s **shuffle indices** and see what happens. For the second vector series, we change the values of indices a, d, b, and c. Thus, when we add the two vector series, we get a different output as the data element is bound to the index position. 

In [None]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10,20,30,40], index = ['c','a','d','b'])

print (first_vector_series)
print ()
print (second_vector_series)

In [None]:
print (first_vector_series + second_vector_series)

***Where ever the indices don't match, it will not add and would hold NOT A NUMBER or NaN***

In [None]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10.0,20,30,40], index = ['a','b','e','f'])

print (first_vector_series)
print ()
print (second_vector_series)

In [None]:
print (first_vector_series + second_vector_series)

## Dataframes

DataFrame is another core feature of the Pandas data structure.

DataFrame is a two-dimensional labeled data structure with columns of potentially different data types.

A DataFrame looks like a spreadsheet with a row-columnar structure or a SQL data table with rows and columns.

There can be several inputs to the DataFrame and we’ll go through them in detail. Let’s have a quick overview of the data inputs:

The core data abstraction layer in Pandas is called a DataFrame

**Any data that you load/initialize using Pandas will be represented in the form of DataFrame**

- To create a Dataframe, you can use the following two approach
 1. Create DF using collection object
 2. Create DF by loading a file

## List

In [None]:
list1 = [[1,'Prashant',1000], [2,'Arun',2000]]
list1

In [None]:
empDataFrameFromList = pd.DataFrame(list1)
empDataFrameFromList

In [None]:
# DataFrame data is represented using Row and Column indexes
# You can replace Column indexes with column names

empDataFrameFromList.columns = ["eid", 'ename', 'esal'] # set the column names
print (empDataFrameFromList)
print ()

display(empDataFrameFromList)

empDataFrameFromList

In [None]:
empDataFrameFromList.columns # get the column names

# Inclass Assignment

# 1


Use the list below to create DF "empDataFrameFromList2" and provide the col names as 'empid','ename','esal'

list2 = [[1,'Prashant',5000], [2,'Arun',8000], [3,'Aman',9899]]

In [3]:
list2 = [[1,'Prashant',5000], [2,'Arun',8000], [3,'Aman',9899]]
empDataFrameFromList2 = pd.DataFrame(list2)
empDataFrameFromList2

Unnamed: 0,0,1,2
0,1,Prashant,5000
1,2,Arun,8000
2,3,Aman,9899


In [4]:
empDataFrameFromList2.columns = ["empid", "ename", "esal"] # providing col names to cols

display(empDataFrameFromList2)

Unnamed: 0,empid,ename,esal
0,1,Prashant,5000
1,2,Arun,8000
2,3,Aman,9899


### dict

A Pandas DataFrame can also be created using ***dictionary of list***. It is very efficient when it comes to indexing or reindexing a dataset for data wrangling purposes. 

In this example, we will create a dataset related to Summer Olympics. 

First, import the Pandas library. Then, declare a dict ‘Olympic_data_list’ and pass the indices ‘HostCity’, ‘No. of Participating Countries’, and ‘Year’ with its data elements as arguments.

As you can observe, it is a tabular representation of data with rows and columns.
Next, pass this list to the DataFrame method ‘pd.DataFrame’ to create a basic DataFrame.

Note that data alignment is automatically taken care here. When we call this DataFrame ‘df_Olympic_data’, the output displays all the rows with its corresponding indices.

In [None]:
olympic_data = {'HostCity':['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'], 
                'Year': [2012, 2008, 2004, 2000, 1996],
                'No. of Participating Countries': [205, 205, 201, 200, 197]}

print (type(olympic_data))
print ()
olympic_data

In [None]:
# dictionary of list as an argument to pd.DataFrame

df_olympic_data = pd.DataFrame(olympic_data)

df_olympic_data

### Series

Series can also be an input to a DataFrame. 

Let’s learn how to create DataFrame from series.

Let’s create two series first. The first series, ‘olympic_series_participation’, is for the number of countries participating for the given year. The second series, ‘olympic_series_country’, is for the cities which held the Olympics that year. 
Now, create a DataFrame ‘df_olympic_series’ and pass both the series as dicts in it. You can also assign column names in the DataFrame and manipulate the dataset as shown in this example. 

In [None]:
olympic_series_participation = pd.Series([205,205,201,200,197], index = [2012,2008,2004,2000,1996])
olympic_series_countries = pd.Series(['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'], index = [2012,2008,2004,2000,1996])

In [None]:
print (olympic_series_participation)
print ()
print (olympic_series_countries)

In [None]:
dict_series = {'No. of Participating Countries': olympic_series_participation, 
                                  'HostCity': olympic_series_countries}
dict_series

In [None]:
# dictionary of Series

df_olympic_series = pd.DataFrame({'No. of Participating Countries': olympic_series_participation, 
                                  'HostCity': olympic_series_countries})
display (df_olympic_series)

### ndarray
    
An ndarray can be used as an input to creating Pandas DataFrame. The use of ndarray is recommended wherever the dataset is number centric and when instances require complex numerical computing.


In [None]:
# Create ndarrays 

olympic_array_year = np.array([2012,2008,2004,2000,1996]) # array
olympic_array_participation = np.array([205,205,201,200,197])
olympic_array_countries = np.array(['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'])

In [None]:
# Create a df with the ndarray dict

df_olympic_array = pd.DataFrame({'No. of Participating Countries': olympic_array_participation, 
                                 'HostCity': olympic_array_countries, 'Year' : olympic_array_year}) # dictionary of array
df_olympic_array

## Accessing column in a dataframe

In [None]:
df_olympic_data

In [None]:
df_olympic_data.HostCity # used for working with a single column

In [None]:
df_olympic_data[['HostCity', "Year"]] # used for accessing multiple columns

In [None]:
df_olympic_data.No. of Participating Countries  # columns with spaces in the name

In [None]:
df_olympic_data[['No. of Participating Countries']] # used for accessing columns with spaces in the name

## Data Operation with Statistical Functions

In [6]:
df_test_scores = pd.DataFrame({'Test1': [95,84,73,88,82,61], 'Test2': [74,85,82,73,77,79]}, 
                              index = ['Jack','Lewis','Patrick','Rich','Kelly','Paula'])

display (df_test_scores)

Unnamed: 0,Test1,Test2
Jack,95,74
Lewis,84,85
Patrick,73,82
Rich,88,73
Kelly,82,77
Paula,61,79


In [None]:
df_test_scores.shape

In [None]:
df_test_scores.max() # default column wise ans

In [None]:
df_test_scores.max(axis = 1)

In [None]:
df_test_scores.mean()

In [None]:
df_test_scores.mean(axis = 1)

## Creating a new column 

In [None]:
df_test_scores.Total_Scores = df_test_scores.Test1 + df_test_scores.Test2
df_test_scores

In [None]:
df_test_scores[['Total_Scores']] = df_test_scores.Test1 + df_test_scores.Test2
df_test_scores

In [7]:
df_test_scores['Total_Scores'] = df_test_scores.Test1 + df_test_scores.Test2
df_test_scores

Unnamed: 0,Test1,Test2,Total_Scores
Jack,95,74,169
Lewis,84,85,169
Patrick,73,82,155
Rich,88,73,161
Kelly,82,77,159
Paula,61,79,140


# 2

# Add a column giving average score of each student named Average_Scores

In [8]:
df_test_scores['Average_Scores'] = (df_test_scores.Test1 + df_test_scores.Test2)/2
df_test_scores

Unnamed: 0,Test1,Test2,Total_Scores,Average_Scores
Jack,95,74,169,84.5
Lewis,84,85,169,84.5
Patrick,73,82,155,77.5
Rich,88,73,161,80.5
Kelly,82,77,159,79.5
Paula,61,79,140,70.0


In [9]:
df_test_scores['Average_Scores'] = df_test_scores.mean(axis = 1)
df_test_scores

Unnamed: 0,Test1,Test2,Total_Scores,Average_Scores
Jack,95,74,169,105.625
Lewis,84,85,169,105.625
Patrick,73,82,155,96.875
Rich,88,73,161,100.625
Kelly,82,77,159,99.375
Paula,61,79,140,87.5


In [10]:
df_test_scores['Average_Scores'] = df_test_scores.Total_Scores/2
df_test_scores

Unnamed: 0,Test1,Test2,Total_Scores,Average_Scores
Jack,95,74,169,84.5
Lewis,84,85,169,84.5
Patrick,73,82,155,77.5
Rich,88,73,161,80.5
Kelly,82,77,159,79.5
Paula,61,79,140,70.0
