<a href="https://colab.research.google.com/github/whatsupabhijit/py_rambling/blob/master/panda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas**

- `Pandas` is a popular Python data analysis tool.

- It provides easy to use and highly efficient data structures.

- These data structures deal with numeric or labeled data, stored in the form of tables

**Data Structures in Pandas**

Three fundamental data structures used in pandas are,

- Series: A 1-D array.

- Data Frame: A 2-D array or two or more Series joined together

- Panel: A 3-D array


# **Series ** is a 1-D array, holding data values of a single variable, captured from multiple observations.

Few examples are:

1. Height of each student, belonging to a Class 'C'.''
2. Amount of daily rainfall received at Station 'X', in July 2017
3. Total sales of a product 'P' in every quarter of 2016.

In [30]:
!pip install panda
!pip install request

Collecting panda
  Downloading https://files.pythonhosted.org/packages/79/03/74996420528fe488ce17c42b6400531c8067d7eb661c304fa3aa8fdad17c/panda-0.3.1.tar.gz
Building wheels for collected packages: panda
  Running setup.py bdist_wheel for panda ... [?25l- done
[?25h  Stored in directory: /root/.cache/pip/wheels/c6/c8/45/06ed898b0bb401c1ff207dbb05b1587ff28860a236d98b1996
Successfully built panda
Installing collected packages: panda
Successfully installed panda-0.3.1
Collecting request
  Downloading https://files.pythonhosted.org/packages/14/03/3985ca165063f8825be231383df7e214551a074b69c8358034b5a6c6d556/request-2018.11.20-py2.py3-none-any.whl
Collecting get (from request)
  Downloading https://files.pythonhosted.org/packages/77/50/a8f316f095765b6299791f8a2263893a01856036be1c377170a91e4696ed/get-2018.11.19-py2.py3-none-any.whl
Collecting post (from request)
  Downloading https://files.pythonhosted.org/packages/60/bf/9e3a234021486d3f25edda957f82c7228a6b4ce5b413bc6e771f634a6ba9/post-201

In [0]:
import numpy as np
import pandas as pd   # mind the extra 's' in pandas as importing panda alone gives you error

In [32]:
## creating panda series from a dictionary

d = {'Math': 180, 'Physics': 157, 'Chemistry:': 157}
pd.Series(d)

Chemistry:    157
Math          180
Physics       157
dtype: int64

In [33]:
# Creating a panda series from numpy

n = 30 + 25 * np.random.randn(3) 

pd.Series(n, index = ['Math', 'Physics', 'Chemistry'])

Math         46.964490
Physics      11.838630
Chemistry     7.569658
dtype: float64

**Problem 1.1**
Create a series named heights_A with values 176.2, 158.4, 167.6, 156.2, and 161.4, which represent heights of 5 students of class A.

Label each student as s1, s2, s3, s4, and s5.

Determine the shape of heights_A and display it.

In [34]:
ht = [176.2, 158.4, 167.6, 156.2, 161.4]

heights_A = pd.Series(ht, index = ['s1', 's2', 's3', 's4', 's5'])
heights_A

heights_A.shape

(5,)

**Problem 1.2**
Create a series named weights_A with values 85.1, 90.2, 76.8, 80.4, and 78.9, which represent weights of 5 students of class A.

Label each student as s1, s2, s3, s4, and s5.

Determine data type of weights_A and display it.

Hint: Make use of Series method available in pandas library.

In [35]:
wt = [85.1, 90.2, 76.8, 80.4,  78.9,]
weights_A = pd.Series(wt, index = ['s1', 's2', 's3', 's4', 's5'])

print(weights_A)

print(weights_A.shape)

s1    85.1
s2    90.2
s3    76.8
s4    80.4
s5    78.9
dtype: float64
(5,)


# A ***Data Frame*** is 2-D shaped and contains data of diff parameters, captured from multiple observations.

Each **observation** is represented by a single **row**, and each **parameter** by a single **column**.

Each column can hold different data type.
Few examples are:

- Height and Weight of all students, belonging to a Class 'C'.
- Daily Rainfall received and Average Temperature of a location 'X', in the year 2017.

---



In [36]:
## Data frame created from pd.Series

s = {'subject': pd.Series(['Math', 'Physics', 'Chemistry']), 'marks': pd.Series([100, 190, 185]) }
df = pd.DataFrame(s)
df

Unnamed: 0,marks,subject
0,100,Math
1,190,Physics
2,185,Chemistry


In [37]:
## Data frame created from list

s = {'subject': ['Math', 'Physics', 'Chemistry'], 'marks': [100, 190, 185] }
df = pd.DataFrame(s)
df

Unnamed: 0,marks,subject
0,100,Math
1,190,Physics
2,185,Chemistry


**Problem 1.3**
Create a Data Frame named df_A, which holds the height and weight of five students namely s1, s2, s3, s4 and s5.

Label the columns as Student_height and Student_weight respectively.

Display index values of df_A.

Hint: Make use of DataFrame method in pandas, and also the series heights_A, weights_A created in previous problems.

In [38]:
sf_dict = {'heights': heights_A, 'weights': weights_A }
df_A = pd.DataFrame(sf_dict)
df_A

Unnamed: 0,heights,weights
s1,176.2,85.1
s2,158.4,90.2
s3,167.6,76.8
s4,156.2,80.4
s5,161.4,78.9


**Problem 1.4**
Create two Series named heights_B and weights_B from two random 1-D numpy arrays with five elements each.

The first array is obtained from the normal distribution of mean 170.0 and standard deviation 25.0.

The second array is derived from the normal distribution of mean 75.0 and standard deviation 12.0.

Label both Series elements with s1, s2, s3, s4 and s5.

In [39]:
heights_B = pd.Series(170 + 25 * np.random.randn(5), index = ['s1', 's2', 's3', 's4', 's5'])
weights_B = pd.Series(75  + 12 * np.random.randn(5), index = ['s1', 's2', 's3', 's4', 's5'])

print(heights_B)

print(weights_B)

s1    174.359272
s2    195.443941
s3    183.597668
s4    214.585255
s5    175.436158
dtype: float64
s1    68.458848
s2    88.282270
s3    67.525174
s4    79.642886
s5    74.686855
dtype: float64


**Problem 1.5**
Create a Data Frame df_B holding height and weight of students s1, s2, s3, s4 and s5 belonging to class B.

Label the columns as Student_height and Student_weight respectively.

Display the column names of df_B.

In [40]:
df_B = pd.DataFrame({'heights': heights_B, 'weigts': weights_B})
df_B

Unnamed: 0,heights,weigts
s1,174.359272,68.458848
s2,195.443941,88.28227
s3,183.597668,67.525174
s4,214.585255,79.642886
s5,175.436158,74.686855


# A **Panel** holds **two** or **more** Data **Frames** together as a single unit.

Few examples are:

- Height and Weight of all students, belonging to 3 Classes 'A', 'B', and 'C'.
- Daily Rainfall received and Average Temperatures of 3 locations 'X', 'Y', and 'Z' captured in the year 2017.

---



**Problem 1.6**
Create a panel p, which holds previously created two data frames df_A and df_B.

Label the first data frame as ClassA and second as ClassB.


Determine the shape of panel p and display it.

In [41]:
df_panel = pd.Panel({'ClassA': df_A, 'ClassB': df_B})

print('Shape:-')
print(df_panel.shape)

print('\n\nClassA')
print(df_panel.ClassA)


print('\n\nClassB')
print(df_panel.ClassB)

print("\n\ndf_panel:- ")
print(df_panel)

Shape:-
(2, 5, 3)


ClassA
    heights  weights  weigts
s1    176.2     85.1     NaN
s2    158.4     90.2     NaN
s3    167.6     76.8     NaN
s4    156.2     80.4     NaN
s5    161.4     78.9     NaN


ClassB
       heights  weights     weigts
s1  174.359272      NaN  68.458848
s2  195.443941      NaN  88.282270
s3  183.597668      NaN  67.525174
s4  214.585255      NaN  79.642886
s5  175.436158      NaN  74.686855


df_panel:- 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: ClassA to ClassB
Major_axis axis: s1 to s5
Minor_axis axis: heights to weigts


# Data Access

Data Access refers to extracting data present in defined data structures.

Pandas provide utilities like **loc** and **iloc** to get data from a Series, a DataFrame, or a Panel.

**Problem 2.1**

Print the second element of series heights_A, as a string.

In [42]:
#Using index
print (heights_A[1])

#Using get() and index
print (heights_A.get(1))


#Using get() and index variable name
print (heights_A.get('s2'))

158.4
158.4
158.4


**Problem 2.2**

Obtain central three elements of Series heights_A.

In [43]:
print (heights_A[1:-1])

s2    158.4
s3    167.6
s4    156.2
dtype: float64


**Accessing Data from a Data Frame**

Pandas allows .loc, .iloc methods for selecting rows.

Using square brackets ([ ]) is also allowed, especially for selecting columns.

**Problem 2.3**

Select the column of df_A, referring to student heights and store in variable height.

Display the type of height

In [44]:
print(df_A)

height = df_A['heights']

print('\n\nheight:')
print(height)


print('\n\n Type of Height')
print(type(height))   ## You could also use print(type(_))   _ refers to latest one

    heights  weights
s1    176.2     85.1
s2    158.4     90.2
s3    167.6     76.8
s4    156.2     80.4
s5    161.4     78.9


height:
s1    176.2
s2    158.4
s3    167.6
s4    156.2
s5    161.4
Name: heights, dtype: float64


 Type of Height
<class 'pandas.core.series.Series'>


**Problem 2.4**

Select the rows corresponding to students s1, s2 of df_A and display them.

In [45]:
df_A[:2]

Unnamed: 0,heights,weights
s1,176.2,85.1
s2,158.4,90.2


In [46]:
print(df_A.loc['s1'])

print(df_A.loc['s2'])

heights    176.2
weights     85.1
Name: s1, dtype: float64
heights    158.4
weights     90.2
Name: s2, dtype: float64


**Problem 2.5**

Select the rows corresponding to students s1, s2 and s5 of df_A in the order s2, s5, s1 and display them.

In [47]:
df_A.loc[['s2', 's5', 's1']]

Unnamed: 0,heights,weights
s2,158.4,90.2
s5,161.4,78.9
s1,176.2,85.1


**# Problem 2.6**

whats the difference between loc and iloc?.


loc gets rows (or columns) with particular labels from the index. iloc gets rows (or columns) at particular positions in the index (so it only takes integers). ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index.

In [65]:
df_p26 = pd.DataFrame({'A':[34, 78, 54], 'B':[12, 67, 43]}, index=['r1', 'r2', 'r3'])

print(df_p26.iloc[1:3])   #### taking integer index as input 
 
print(df_p26.loc['r2':'r3'])   #### taking given string index as input

     A   B
r2  78  67
r3  54  43
     A   B
r2  78  67
r3  54  43


**Problem 2.7**


How do you add a new column 'C' to a data frame as similar to the previous one, with 3 rows

In [68]:
df_p26['C'] = [12, 98, 45]
df_p26

Unnamed: 0,A,B,C
r1,34,12,12
r2,78,67,98
r3,54,43,45


**Problem 2.8**

How do you delete a column from a data fram?

In [69]:
del df_p26['B']
df_p26

Unnamed: 0,A,C
r1,34,12
r2,78,98
r3,54,45


# Knowing a Series

It is possible to understand a Series better by using ***describe*** method.

The method provides details like mean, std, etc. about a series.

In [71]:
temp = pd.Series(28 + 10*np.random.randn(10))

print(temp.describe())

count    10.000000
mean     24.824926
std      12.243934
min      -1.654203
25%      16.894413
50%      29.899773
75%      33.180425
max      37.426783
dtype: float64


# Knowing a DataFrame

Two methods majorly ***info*** and ***describe*** can be used to know about the data, present in a data frame.

- **describe** method by default provides details of **only** ***numeric*** fields.
- You can use ***include*** argument to know about other columns.



In [79]:
df = pd.DataFrame({
    'temp':pd.Series(28 + 10*np.random.randn(10)), 
    'rain':pd.Series(100 + 50*np.random.randn(10)),
    'location':list('AAAAABBBBB')
})

print(df.info(), end= "\n\n")

print(df.describe(), end='\n\n')

print(df.describe(include=['object', 'float64']))  ### object is used to include location

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
location    10 non-null object
rain        10 non-null float64
temp        10 non-null float64
dtypes: float64(2), object(1)
memory usage: 320.0+ bytes
None

             rain       temp
count   10.000000  10.000000
mean    82.432916  24.917036
std     61.340448   8.849187
min    -10.135947  11.152692
25%     32.898724  17.700087
50%     95.646498  25.403062
75%    125.735164  32.196103
max    161.618502  37.847981

       location        rain       temp
count        10   10.000000  10.000000
unique        2         NaN        NaN
top           A         NaN        NaN
freq          5         NaN        NaN
mean        NaN   82.432916  24.917036
std         NaN   61.340448   8.849187
min         NaN  -10.135947  11.152692
25%         NaN   32.898724  17.700087
50%         NaN   95.646498  25.403062
75%         NaN  125.735164  32.196103
max         NaN  161.618502  37.847981


# Pandas I/O

- pandas.read_csv()                   --> mainly path in  qoutes, delimitter
- pandas.read_excel()
- pandas.read_sql_table
- pandas.read_json()
- pandas.DataFrame.to_csv()     --> 
- pandas.DataFrame.to_excel()


In [132]:
# You can always put ?pd.read_csv enter to get what the function requires

df_A = pd.DataFrame({'A': pd.Series([10,20,30]), 'B':pd.Series([90,100,80])})
print(df_A)

df_A.to_csv("app.csv")

    A    B
0  10   90
1  20  100
2  30   80


# How to read csv file from an url

In [99]:
import urllib.request as ur

file = ur.urlopen(ur.Request("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"))

data = pd.read_csv(file, sep = ',' , header  = None, decimal= '.', names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target'])

data.head(10)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,sepal_length,sepal_width,petal_length,petal_width,species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa


To read from mysql database you can use below commands. Since I don't have mysql database loaded here just showing the code snippet required.

`from sqlalchemy import create_engine`

`engine = create_engine('mysql+localhost:8888/table')`

`df = pd.read_sql_table('Table:', engine)`
`df`



# **How to deal with large datasets in smaller chunks**

- Set the **iterator** to True in pandas methods
- use the **get_chunk()** to receive data.

In [111]:
import urllib.request as ur

file = ur.urlopen(ur.Request("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"))

data_iterator = pd.read_csv(file, sep = ',' , header  = None, decimal= '.', names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target'], 
                            iterator = True)

chunk = [data_iterator.get_chunk(100), data_iterator.get_chunk(5), data_iterator.get_chunk(2)]

chunk[2] 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
105,6.5,3,5.8,2.2,virginica
106,7.6,3,6.6,2.1,virginica


**Reading Data from Json**

pandas provides the utilities read_json and to_json to deal with JSON strings or files.

In [130]:
import urllib.request as ur

file = ur.urlopen(ur.Request("https://raw.githubusercontent.com/whatsupabhijit/py_rambling/master/pandas/contact.json"))

##
## If you don't want to take from url and simply declare in your code
## then import json and use json.dumps(give your json variable name here)
#
# import json
# person = [your json goes here]
# person_json_str = json.dumps(person)
# df_json = pd.read_json(person_json_str)
#

df_json = pd.read_json(file)

df_json.head(1)

Unnamed: 0,_id,about,address,age,balance,company,email,eyeColor,favoriteFruit,friends,...,guid,index,isActive,latitude,longitude,name,phone,picture,registered,tags
0,5bf8e6475fc62129cdf61a9a,Voluptate fugiat ut nisi aute adipisicing volu...,"526 Oliver Street, Edgar, Delaware, 6455",32,"$3,010.22",ENORMO,hodgesrivera@enormo.com,green,strawberry,"[{'id': 0, 'name': 'Berta Hays'}, {'id': 1, 'n...",...,90ae433a-6ad4-438d-a6c7-d8668d9df7ef,0,True,38.115627,-21.175877,Hodges Rivera,+1 (933) 471-2714,http://placehold.it/32x32,2018-01-30T09:25:45 -08:00,"[esse, minim, reprehenderit, ullamco, consequa..."


In [0]:
#?pd.DataFrame.to_csv
?pd.read_csv

# Indexing

Till now have seen very basic indexing of rows or columns. But pandas provide more. Letss see what we can index and how.

# Single Level Indexing

till now we have seen when data frame object was created we passed one index ***paramter***. 
We can still do that. 

But DataFrame also has index as an attribute. Once dataframe is created you can set the index ***attribute*** later.

In [143]:
row = 5
col = 3

df = pd.DataFrame(np.random.rand(row,col))

df.index = [ 'row#' + str(i) for i in range(1, row+1) ]

df

Unnamed: 0,0,1,2
row#1,0.128874,0.494885,0.80433
row#2,0.928458,0.531548,0.421639
row#3,0.581121,0.120799,0.347124
row#4,0.05289,0.279454,0.839811
row#5,0.414797,0.700939,0.726013


# Date Level Indexing

How to create a date range from any day of specified perio and freequency

- **date_range**
- **to_datetime**

More at [here](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)

In [156]:

print("These are the 10 days" , pd.date_range('1/1/2018', periods=10, freq='D'))
print("These are the 5 months" , pd.date_range('1/1/2018', periods=5, freq='M'))
print("These are the 3 years" , pd.date_range('1/1/2018', periods=3, freq='Y'))
print("These are the 3 weeks" , pd.date_range('1/1/2018', periods=3, freq='W'))
print("These are the 4 Business Quarters" , pd.date_range('1/1/2018', periods=4, freq='BQ'))

These are the 10 days DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10'],
              dtype='datetime64[ns]', freq='D')
These are the 5 months DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31'],
              dtype='datetime64[ns]', freq='M')
These are the 3 years DatetimeIndex(['2018-12-31', '2019-12-31', '2020-12-31'], dtype='datetime64[ns]', freq='A-DEC')
These are the 3 weeks DatetimeIndex(['2018-01-07', '2018-01-14', '2018-01-21'], dtype='datetime64[ns]', freq='W-SUN')
These are the 4 Business Quarters DatetimeIndex(['2018-03-30', '2018-06-29', '2018-09-28', '2018-12-31'], dtype='datetime64[ns]', freq='BQ-DEC')


In [165]:
#Date values with formatting

print(pd.to_datetime(['20180310', '20181214', 'stringnotindateformat'], format='%Y%m%d', errors='coerce'), end = '\n\n') # NaT - Not a Time

print(pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None])))

DatetimeIndex(['2018-03-10', '2018-12-14', 'NaT'], dtype='datetime64[ns]', freq=None)

0   2009-07-31
1   2010-01-10
2          NaT
dtype: datetime64[ns]


**All is fine but how is it related to DataFrame**


In [167]:
df_index = pd.DataFrame({'year': [2014, 2015], 'month': [7,2], 'day': [28,16]})

pd.to_datetime(df_index)

0   2014-07-28
1   2015-02-16
dtype: datetime64[ns]

In [0]:
?pd.date_range

# Multi Level Indexing / Hierarchical indexing

We have seen single level indexing. Now we will see hierarchical indexing.

To deal with this we need to create Multi Level object.




In [198]:
lists = [['Abhijit', 'Abhijit',   'Abhijit', 'Joyee', 'Joyee'],
         ['won',      'lost',   'won',     'lost',    'won']]


m_index = pd.MultiIndex.from_arrays(lists, names = ['name' , 'game'])

m_index

#Now you can use this multi level index in your Series/DataFrame

m_series = pd.Series(np.random.randn(5)*100 % 100, index=m_index)
m_series

name     game
Abhijit  won     50.890795
         lost    96.437181
         won     83.203574
Joyee    lost    57.492196
         won     25.859586
dtype: float64

In [199]:
m_series['Abhijit']

game
won     50.890795
lost    96.437181
won     83.203574
dtype: float64

In [200]:
m_series['Abhijit','won']

(Abhijit, won)    50.890795
(Abhijit, won)    83.203574
dtype: float64

**Indexing Summary **

1. Create an index named dates, representing a range of dates starting from 1-Sep-2017 to 15-Sep-2017.

2. Convert the following list of date strings into datetime objects. Capture the result in search_dates

                    datelist = ['14-Sep-2017', '9-Sep-2017']

3. Filter those index dates from dates which match dates in search_dates. Display the filtered dates.  

     Hint: use isin method associated with DatetimeIndex objects.

4. Create a multi index named mi_index of two levels, represented in the below array arraylist. Display levels of mi_index.

                      `arraylist = [['classA'] * 5 + ['classB'] * 5, ['s1', 's2', 's3','s4', 's5'] * 2]`

In [221]:
#1
dates = pd.date_range('9/1/2017', periods=15, freq = 'D')
print(dates, end='\n\n')

#2
datelist = ['14-Sep-2017', '9-Sep-2017']
search_dates = pd.to_datetime(datelist)
print(search_dates, end= '\n\n')


#3
filtered_dates_bool = dates.isin(search_dates)
#print (filtered_dates_bool, end='\n\n')

filtered_dates = dates[filtered_dates_bool]
print (filtered_dates, end='\n\n')


#4
arraylist = [['classA'] * 5 + ['classB'] * 5, ['s1', 's2', 's3','s4', 's5'] * 2]
mi_index = pd.MultiIndex.from_arrays(arraylist)
print(mi_index.levels)

DatetimeIndex(['2017-09-01', '2017-09-02', '2017-09-03', '2017-09-04',
               '2017-09-05', '2017-09-06', '2017-09-07', '2017-09-08',
               '2017-09-09', '2017-09-10', '2017-09-11', '2017-09-12',
               '2017-09-13', '2017-09-14', '2017-09-15'],
              dtype='datetime64[ns]', freq='D')

DatetimeIndex(['2017-09-14', '2017-09-09'], dtype='datetime64[ns]', freq=None)

DatetimeIndex(['2017-09-09', '2017-09-14'], dtype='datetime64[ns]', freq=None)

[['classA', 'classB'], ['s1', 's2', 's3', 's4', 's5']]
