# Introduction to Pandas

Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. This library is built on top of the NumPy library. Pandas is fast and it has high performance & productivity for users.

Pandas generally provide two data structures for manipulating data, They are:

Series

DataFrame


### Pandas Series 
Series:
Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.![download.png](attachment:download.png)

### Pandas DataFrame
DataFrame
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns

![download%20%281%29.png](attachment:download%20%281%29.png)

#### Series

- Creating a Series
- Accessing element of Series
- Indexing and Selecting Data in Series
- Binary operation on Series
- Conversion Operation on Series

In [None]:
# load the package
import pandas as pd

In [None]:
# create pandas series
s = pd.Series([1, 2, 3, 4])

In [None]:
print(type(s))
print(s)

### indexing & Slicing

In [None]:
# slicing
s[2]

In [None]:
# series values
s.values

In [None]:
# series index
s.index

In [None]:
print(s.keys())

In [None]:
# create series
s = pd.Series([6, 7, 8, 9, 10], index=['a', 'b', 'c', 'd','e'])
s

In [None]:
s['a']

In [None]:
# slicing
s['b']

In [None]:
# slicing
s[0]

In [None]:
s.index

In [None]:
# adding element
s['f'] = 11
s

In [None]:
# updating element
s['a'] = 100
print(s.keys())

#### methods associated with the Series object
-  append
-  drop
-  astype


In [None]:
# creating pandas series
ser_1 = pd.Series([12,13,14], index=['g','h','i'])

In [None]:
# append one series into another
s = s.append(ser_1)
s

In [None]:
# combine by using concat method
s1 = pd.concat([s, ser_1])
s1

In [None]:
s

In [None]:
# drop an element
s.drop('g')

In [None]:
s

In [None]:
s.drop('g', inplace=True)

In [None]:
s

In [33]:
# changes will be appear in the original series object
#S=S.drop(['h','i'])
s.drop(['h','i'],inplace=True)
s

a     6
b     7
c     8
d     9
e    10
f    11
dtype: int64

#### create a series from a dictionary

In [34]:
# dict to series
d1 = {1:'a', 2:'b', 3:'c', 4:'d', 5:'e'}
s1 = pd.Series(d1)
print(s1)

1    a
2    b
3    c
4    d
5    e
dtype: object


In [35]:
d2 = dict(s1)
d2

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

In [36]:
list(s1)

['a', 'b', 'c', 'd', 'e']

In [37]:
# int elements
l1 = [1,2,3,4,5]
s1 = pd.Series(l1)
print(s1)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [38]:
# float elements
t1 = (1.0, 2.0, 3.0, 4.0, 5.0)
s1 = pd.Series(t1)
print(s1)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64


#### operations on Series object

- +
- *
- **

In [39]:
# adding scaler value to the series elements
s1+1

0    2.0
1    3.0
2    4.0
3    5.0
4    6.0
dtype: float64

In [40]:
# multiplying series elements with scaler
s1*2

0     2.0
1     4.0
2     6.0
3     8.0
4    10.0
dtype: float64

In [41]:
# exponential operator
s1**2

0     1.0
1     4.0
2     9.0
3    16.0
4    25.0
dtype: float64

In [42]:
# exponential operator
s1**s1

0       1.0
1       4.0
2      27.0
3     256.0
4    3125.0
dtype: float64

###  DataFrame

- Creating a DataFrame
- Dealing with Rows and Columns
- Indexing and Selecting Data
- Working with Missing Data
- Iterating over rows and columns


In [43]:
# creating pandas dataframe
d1 = pd.DataFrame()
print(type(d1))
print(d1)

<class 'pandas.core.frame.DataFrame'>
Empty DataFrame
Columns: []
Index: []


In [51]:
#### create a dataframe from a dictionary of lists
d = {'Price':[5, 10, 13, 26, 43], 'Sales':[10,30,50,45,42]} 

In [52]:
df=pd.DataFrame(d)

In [53]:
df

Unnamed: 0,Price,Sales
0,5,10
1,10,30
2,13,50
3,26,45
4,43,42


In [55]:
l1 = [[1,2,3], [4,5,6], [7,8,9]]
df = pd.DataFrame(l1, columns = ('A', 'B', 'C'))
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


In [60]:
#### create a dataframe from a dictionary of dictionaries
d1 = {'Price':{'r1':20, 'r2':56}, 'Sales':{'r1':500, 'r2':600}}

In [61]:
df1 = pd.DataFrame(d1)
df1

Unnamed: 0,Price,Sales
r1,20,500
r2,56,600


In [62]:
type(df1['Price'])

pandas.core.series.Series

In [63]:
print(df1['Price'])

r1    20
r2    56
Name: Price, dtype: int64


In [64]:
df1['Price']['r1']

20

In [65]:
df1['Price']['r2']

56

In [67]:
#### create a dataframe from a dictionary of Series
d = {'Price':pd.Series([5, 10, 13, 26, 43]), 'Sales':pd.Series([10,30,50,45,42])}

In [68]:
df2 = pd.DataFrame(d)
df2

Unnamed: 0,Price,Sales
0,5,10
1,10,30
2,13,50
3,26,45
4,43,42


In [71]:
s1 = pd.Series([5, 10, 13, 26, 43])
s2 = pd.Series([10, 30, 50, 45, 42])
df = pd.DataFrame({'Price':s1, 'Sales':s2})
df

Unnamed: 0,Price,Sales
0,5,10
1,10,30
2,13,50
3,26,45
4,43,42


In [73]:
# from multiple dictionaries (provided they have the same key:value pair) within a list  ##list of dictionaries
l1 = [{'Price':20,'Sales':500,'Qty':25},{'Price':30,'Sales':600,'Qty':20}]

In [74]:
df1 = pd.DataFrame(l1)
df1

Unnamed: 0,Price,Sales,Qty
0,20,500,25
1,30,600,20


In [None]:
# accessing rows and columns

In [69]:
# access column
df1['Sales']

r1    500
r2    600
Name: Sales, dtype: int64

In [70]:
# just another way to access column
df1.Sales

r1    500
r2    600
Name: Sales, dtype: int64

#### rename the column names

In [None]:
# rename column
#data.rename(columns={'old colname':'new col name'})

In [78]:
# inplace=True helps to allow the changes appear in the original data
df1.rename(columns={'Qty':'Quantity'}, inplace=True)

In [79]:
df1

Unnamed: 0,Price,Sales,Quantity
0,20,500,25
1,30,600,20


In [83]:
df2 = df1.rename(columns={'Sales':'sales'})

In [85]:
df2

Unnamed: 0,Price,sales,Quantity
0,20,500,25
1,30,600,20


#### methods
- head()
- describe()
- info()

In [86]:
# get first n entries
df1.head()

Unnamed: 0,Price,Sales,Quantity
0,20,500,25
1,30,600,20


In [87]:
# get last n entries
df1.tail()

Unnamed: 0,Price,Sales,Quantity
0,20,500,25
1,30,600,20


In [88]:
# dataframe data summary
df1.describe()

Unnamed: 0,Price,Sales,Quantity
count,2.0,2.0,2.0
mean,25.0,550.0,22.5
std,7.071068,70.710678,3.535534
min,20.0,500.0,20.0
25%,22.5,525.0,21.25
50%,25.0,550.0,22.5
75%,27.5,575.0,23.75
max,30.0,600.0,25.0


In [89]:
df1['Sales'].describe()

count      2.000000
mean     550.000000
std       70.710678
min      500.000000
25%      525.000000
50%      550.000000
75%      575.000000
max      600.000000
Name: Sales, dtype: float64

In [90]:
# dataframe inforamtion
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Price     2 non-null      int64
 1   Sales     2 non-null      int64
 2   Quantity  2 non-null      int64
dtypes: int64(3)
memory usage: 176.0 bytes


##### loc(label based) and iloc(integer index based)
loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.

In [91]:
df1

Unnamed: 0,Price,Sales,Quantity
0,20,500,25
1,30,600,20


In [92]:
# loc
type(df1.loc[1])

pandas.core.series.Series

In [95]:
dict(df1.loc[1])

{'Price': 30, 'Sales': 600, 'Quantity': 20}

In [96]:
df1

Unnamed: 0,Price,Sales,Quantity
0,20,500,25
1,30,600,20


In [97]:
# extract 1st value of price column
# loc
df1.loc[0]['Price']

20

In [98]:
# loc
df1.loc[0]['Sales']

500

In [99]:
# extract 2nd value of price column
# loc
df1.loc[1]['Price']

30

In [100]:
df1 = pd.DataFrame({'Price':[5, 10, 13, 26, 43], 
                    'Sales':[10,30,50,45,42]},
                   index=list('abcde'))

In [101]:
df1

Unnamed: 0,Price,Sales
a,5,10
b,10,30
c,13,50
d,26,45
e,43,42


In [102]:
#extract 1st row
a = df1.loc['a']
print(type(a))
a

<class 'pandas.core.series.Series'>


Price     5
Sales    10
Name: a, dtype: int64

In [103]:
d = df1.loc[['c']]
print(type(d))
d

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Price,Sales
c,13,50


In [None]:
df1['Price']

In [104]:
# extract 1st and 3rd row value for 1st column price
df1.loc[['a','c']]['Price']

a     5
c    13
Name: Price, dtype: int64

In [105]:
df1

Unnamed: 0,Price,Sales
a,5,10
b,10,30
c,13,50
d,26,45
e,43,42


In [106]:
df1.iloc[0][0]

5

In [107]:
# iloc
df1.iloc[2,1]

50

In [114]:
# iloc
df1.iloc[[2, 4]]

Unnamed: 0,Price,Sales
c,13,50
e,43,42


In [117]:
# read csv file into dataframe
df = pd.read_csv('medal.csv')

In [118]:
df.shape

(29216, 10)

In [119]:
df.head()

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver


In [120]:
store=pd.read_csv("store.csv", encoding='latin')

In [123]:
df.head()

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver


In [124]:
df.tail(4)

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
29212,Beijing,2008,Wrestling,Wrestling Gre-R,"MIZGAITIS, Mindaugas",LTU,Men,96 - 120kg,M,Bronze
29213,Beijing,2008,Wrestling,Wrestling Gre-R,"PATRIKEEV, Yuri",ARM,Men,96 - 120kg,M,Bronze
29214,Beijing,2008,Wrestling,Wrestling Gre-R,"LOPEZ, Mijain",CUB,Men,96 - 120kg,M,Gold
29215,Beijing,2008,Wrestling,Wrestling Gre-R,"BAROEV, Khasan",RUS,Men,96 - 120kg,M,Silver


In [125]:
# check the number of rows and columns 
df.shape

(29216, 10)

In [128]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   City          29216 non-null  object
 1   Edition       29216 non-null  int64 
 2   Sport         29216 non-null  object
 3   Discipline    29216 non-null  object
 4   Athlete       29216 non-null  object
 5   NOC           29216 non-null  object
 6   Gender        29216 non-null  object
 7   Event         29216 non-null  object
 8   Event_gender  29216 non-null  object
 9   Medal         29216 non-null  object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB


In [126]:
df.describe()

Unnamed: 0,Edition
count,29216.0
mean,1967.713171
std,32.406293
min,1896.0
25%,1948.0
50%,1976.0
75%,1996.0
max,2008.0


In [130]:
df['City'].describe()

count      29216
unique        22
top       Athens
freq        2149
Name: City, dtype: object

In [131]:
df['Gender'].describe()

count     29216
unique        2
top         Men
freq      21721
Name: Gender, dtype: object

In [132]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   City          29216 non-null  object
 1   Edition       29216 non-null  int64 
 2   Sport         29216 non-null  object
 3   Discipline    29216 non-null  object
 4   Athlete       29216 non-null  object
 5   NOC           29216 non-null  object
 6   Gender        29216 non-null  object
 7   Event         29216 non-null  object
 8   Event_gender  29216 non-null  object
 9   Medal         29216 non-null  object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB


In [133]:
# check the column list of a dataframe
df.columns

Index(['City', 'Edition', 'Sport', 'Discipline', 'Athlete', 'NOC', 'Gender',
       'Event', 'Event_gender', 'Medal'],
      dtype='object')

In [134]:
# fetch dataframe specific columns
df[['City', 'Edition', 'Sport']].head()

Unnamed: 0,City,Edition,Sport
0,Athens,1896,Aquatics
1,Athens,1896,Aquatics
2,Athens,1896,Aquatics
3,Athens,1896,Aquatics
4,Athens,1896,Aquatics
