# PANDAS

At the core of pandas are 3 data-structures:
- Series
- DataFrames
- Panel

![](../data-csv/media-pics-videos/pandas-data-structure.png)

![](../data-csv/media-pics-videos/pandas-data-structure1.png)

# Series

A SERIES IS USED TO MODEL ONE DIMENSIONAL DATA, SIMILAR TO A LIST IN Python. The Series object also has a few more bits of data, including an index and a name. A common idea through pandas is the notion of an axis. Because a series is one dimensional, it has a single axis—the index.

In [2]:
#First step is to import pandas
import pandas as pd

In [3]:
#Create a series
cisco_weekly_share_price = pd.Series([39, 45, 48, 49, 42], name="share-price")
cisco_weekly_share_price
# Left most column is called Index (or axis). 0,1,2..etc are index-labels or axis-labels. 
# Dataframe will have 2-axis, one for column and one for row
# Right column is the data/values
# Last line prints name of series and data-type

0    39
1    45
2    48
3    49
4    42
Name: share-price, dtype: int64

In [None]:
# Get the data-type of the series
cisco_weekly_share_price.dtypes

In [None]:
# Get the dimension of the series
cisco_weekly_share_price.shape

In [None]:
# Get the first five elements of the series
cisco_weekly_share_price.head

In [4]:
# Get the last five elements of the series
cisco_weekly_share_price.tail

<bound method NDFrame.tail of 0    39
1    45
2    48
3    49
4    42
Name: share-price, dtype: int64>

### Index of the series
* The default values for an index are linearly incrementing integers starting from 0
* We can over-ride default index by specifying "index" parameter
* "index" needs not be always interger, it can be of any type (including strings)

In [5]:
# Inspect the index of the series
# The default values for an index are linearly incrementing integers
cisco_weekly_share_price.index

RangeIndex(start=0, stop=5, step=1)

In [8]:
# Index last element of the series
cisco_weekly_share_price.index[-1]

4

In [9]:
# Index can string also
cisco_weekly_share_price = pd.Series([39, 45, 48, 49, 42], name="share-price", 
                                     index=['week3','week1','week4','week2','week5'])
cisco_weekly_share_price.index

Index(['week3', 'week1', 'week4', 'week2', 'week5'], dtype='object')

In [10]:
# Index the string
cisco_weekly_share_price['week4']

48

### Data/values in the series
* Values column needs not be homogenous, it can be mixture of data-types. In that case dtype will be object(string)

In [None]:
cricketer_details = pd.Series(['Sachin','Tendulkar',46,100],name="details",
                             index=['First Name','Last Name','Age','No of centuries'])
cricketer_details

### CRUD of series

#### Creating series using lists and dictionaries
- Lists can be used to init series
- Dictionaries can be used to init series

In [None]:
cricketer_details = pd.Series(['Sachin','Tendulkar',46,100],name="details",
                             index=['First Name','Last Name','Age','No of centuries'])
cricketer_details

In [None]:
cricketer_details = pd.Series({'First Name':'Sachin','Last Name': 'Tendulkar', 'Age':46, 'No of centuries':100},name="details",
                             index=['First Name','Last Name','Age','No of centuries'])
cricketer_details

#### Reading from Series
- Use index directly
- Use iterator and iterate through Series in a for loop
- Use Series.index and Series.values to iterate through index and data respectively
- Use Series.index and Series.values in logical expressions

In [None]:
# Use index directly
cricketer_details['First Name']

In [None]:
# Iterate over all the entries in the series
for item in cricketer_details.values:
    print(item)

In [None]:
# Iterate over all the indices in the series
for item in cricketer_details.index:
    print(item)

In [None]:
# Logical expressions using Series
'Sachin' in cricketer_details.values

In [None]:
# Logical expressions using Series
'First Name' in cricketer_details.index

In [None]:
for item in cricketer_details.iteritems():
    print(item)

#### Updating Series data
- If the index is existant, value is updated in-place for that index
- If the index is non-existant, value is appended to the series with new index
- If the index has multiple-indices, all values are updated
- To update an item based on the location use .iloc
- The series object has a .set_value method that will both add a new item to the existing series and return a series

In [None]:
# To update a value for a given index label, 
# the standard index assignment operation works and performs the update in-place
cricketer_details['No of centuries'] = 101
cricketer_details

In [None]:
# The index assignment operation also works to add a new index and a value.
cricketer_details['No of ODI'] = 350
cricketer_details

In [None]:
#  To update values based purely on position, perform an index assignment of the .iloc attribute
cricketer_details.iloc[0] = 'Sachin T'
cricketer_details

#### Deletion
Deletion is not common in the pandas world. It is more common to use filters or masks to create a new series that has only the items that you want. However, if you really want to remove entries, you can delete based on index entries.

In [None]:
# Delete an item using index
del cricketer_details['No of ODI']
cricketer_details

## Series Indexing

In [None]:
#Read the CSV file from the district-wise-education data for India
df = pd.read_csv("../data-csv/csv-files/2015_16_Districtwise.csv")

In [None]:
#Read the first few rows of the dataframe
df.head()

In [None]:
#Read the last few rows of the dataframe
df.tail()

In [None]:
#Shape api will list the number of rows and columns
df.shape

In [None]:
#Print the data-types of different column
df.dtypes

In [None]:
#Print the data-type of a particular column
print(df.dtypes['STATNAME'])

In [None]:
#List the columns in the table
list(df.columns)

In [None]:
df.info()

In [None]:
#Extract only a particular column from the dataframe using column name
df_districts = df['DISTNAME']
print(df_districts)

In [None]:
#Extract first column from the dataframe using iloc
#The iloc indexer syntax is data.iloc[<row selection>, <column selection>]
df_row0 = df.iloc[:,0] # first column of data frame
print(df_row0)

In [None]:
#Extract last column from the dataframe using iloc
df_last_row = df.iloc[:,-1] # first column of data frame
print(df_last_row)

In [None]:
#Extract second row from the dataframe
df_row2 = df.iloc[1]
print(df_row1)

In [None]:
#Extract first five rows of the dataframe
df_five_rows = df.iloc[1:5]
print(df_five_rows)

In [None]:
#Extract first five rows and five columns of the dataframe
df_five_rows_columns = df.iloc[1:5, 1:5]
print(df_five_rows_columns)

In [None]:
#Extract first five rows and selective columns of the dataframe
df_five_rows_columns = df.iloc[1:5, [1,4,5]]
print(df_five_rows_columns)

In [None]:
#Convert dataframe into multi-dimesional array
df.iloc[1:5,1:].values

In [None]:
#Count the number of rows particular value occurs inte the data-frame
df['STATNAME'].value_counts()

In [None]:
#Select rows which has number of villages > 1500
df[df['VILLAGES'] > 1500]

In [None]:
#Install lxml to parse html tables
#!type python
#!type -a pip
#!pip3 install lxml

In [None]:
#Read cisco share price history table from wikipedia URL

#lxml is needed by read_html api, hence need to import it
import lxml

#URL for wikipedia link
url_cisco = 'https://en.wikipedia.org/wiki/Cisco_Systems'

#pandas.read_html table reads all the tables in URL specified
#which has match-string specified by match
#If there are multiple table, it will be read into a list of data-frames
df_cisco = pd.read_html(url_cisco, match='Employees', header=0)

In [None]:
#Since read_html returns list of data-frames
#Get the first table into a data-frame
df_cisco_history = df_cisco[0]

In [None]:
#print first five rows
df_cisco_history.head()

In [None]:
#print yearwise cisco share-price
df_cisco_history.iloc[:,[0,4]]

In [None]:
#URL for wikipedia link
url_cisco = 'https://finance.yahoo.com/quote/CSCO/history?period1=1425081600&period2=1582848000&interval=1d&filter=history&frequency=1d'
df_cisco_daily = pd.read_html(url_cisco, match='Volume', header=0)
df_cisco_daily[0].head()

## Plots using dataframes