# PANDAS COMPLETE TUTORIAL

## CH01_INTRODUCTION

In [1]:
intro="""
Pandas is a Python library used for working with data sets.

Pandas is an open-source Python Library providing high-performance data manipulation and 
analysis tool using its powerful data structures

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and
was created by Wes McKinney in 2008.

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or 
NULL values. This is called cleaning the data.

The source code for Pandas is located at this github repository https://github.com/pandas-dev/pandas

Pandas provides two types of classes for handling data:

Series: a one-dimensional labeled array holding data of any type
such as integers, strings, Python objects etc.

DataFrame: a two-dimensional data structure that holds data like a two-dimension array or 
a table with rows and columns.
"""

In [None]:
#FEATURES 
"""
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.

To install pandas 

pip install pandas

#to import 
import pandas 
or 
import pandas as pd (universal format)
"""

## CH02 WORKING WITH PANDAS SERIES 

In [7]:
#WORKIGN WITH SERIES

l1=[21,22,23,24,25,65,43]

import pandas as pd 

ser=pd.Series(l1)
print("the pandas series is ")
print(ser)
print("the series type is = ",type(ser))
    

the pandas series is 
0    21
1    22
2    23
3    24
4    25
5    65
6    43
dtype: int64
the series type is =  <class 'pandas.core.series.Series'>


In [8]:
#creating another series 

numbers=[25,76,45,34,45]
index=['a','b','c','d','e']
ser=pd.Series(numbers,index=index)
print(ser)

a    25
b    76
c    45
d    34
e    45
dtype: int64


In [None]:
"""
Attributes and Methods:
series.values: Returns the values of the Series as a NumPy array.
series.index: Returns the index of the Series.
series.dtype: Returns the data type of the values.
series.head(n): Returns the first n elements.
series.tail(n): Returns the last n elements.
series.describe(): Generates descriptive statistics.
"""

In [14]:

numbers=[25,76,45,34,45,55,67,89,56,43]
index=['a','b','c','d','e','f','g','h','i','j']
ser=pd.Series(numbers,index=index)

print("data series are = ",ser.values)
print("data index are = ",ser.index)
print("data type is = ",ser.dtype)

data series are =  [25 76 45 34 45 55 67 89 56 43]
data index are =  Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
data type is =  int64


In [15]:
ser.head(3)

a    25
b    76
c    45
dtype: int64

In [16]:
ser.head() #it will show first 5 records 

a    25
b    76
c    45
d    34
e    45
dtype: int64

In [17]:
n=3
ser.tail(n) 

h    89
i    56
j    43
dtype: int64

In [18]:
ser.tail() #it will show last 5 records 

f    55
g    67
h    89
i    56
j    43
dtype: int64

In [19]:
ser.describe() $decribes about data set 

count    10.000000
mean     53.500000
std      19.449364
min      25.000000
25%      43.500000
50%      50.000000
75%      64.250000
max      89.000000
dtype: float64

In [None]:
#working with slicing and indexing on pandas series 

In [30]:
import pandas as pd 
numbers=[25,76,45,34,45,55,67,89,56,43,55,67,89,]
index=['a','b','c','d','e','f','g','h','i','j',"k","l","m"]

ser=pd.Series(numbers,index=index)

#integer based slicing 
#print("the pandas series are = ",ser)
print("the 0th element is = ",ser[0])

print("the last element is = ",ser[len(ser)-1])
print("the last element is = ",ser[-1])
print("the last element is = ",ser[12])

#Label-based Indexing:
print("the 0th element is = ",ser['a'])
print("the last element is = ",ser['m'])
print("the second element is = ",ser['b'])

the 0th element is =  25
the last element is =  89
the last element is =  89
the last element is =  89
the 0th element is =  25
the last element is =  89
the second element is =  76


In [36]:
#slicing series[start:stop:step]
print(ser["a":"d"]) #d include and default step is 1
print(ser[0:3]) #3 exclude and default step is 1


a    25
b    76
c    45
d    34
dtype: int64
a    25
b    76
c    45
dtype: int64


In [37]:
print(ser["a":"d":2]) #d include and step is 2
print(ser[0:3:2]) #3 exclude and step is 1

a    25
c    45
dtype: int64
a    25
c    45
dtype: int64


In [39]:
#boolean indexing 
bool_data=ser>=50 #it will create boolean series
print(bool_data)
print(ser[bool_data]) #it prints value more or equl to 50 

a    False
b     True
c    False
d    False
e    False
f     True
g     True
h     True
i     True
j    False
k     True
l     True
m     True
dtype: bool
b    76
f    55
g    67
h    89
i    56
k    55
l    67
m    89
dtype: int64


In [40]:
print(ser[ser>70])

b    76
h    89
m    89
dtype: int64


In [41]:
#fancy indexing 
print(ser[["a","c","d","e"]])

a    25
c    45
d    34
e    45
dtype: int64


In [43]:
#fancy indexing 
print(ser[['m',"a","c","d","e",'e','e']])

m    89
a    25
c    45
d    34
e    45
e    45
e    45
dtype: int64


In [48]:
# Indexing with iloc and loc:
import pandas as pd 
numbers=[25,76,45,34,45,55,67,89,56,43,55,67,89,]
index=['a','b','c','d','e','f','g','h','i','j',"k","l","m"]

data_ser=pd.Series(numbers,index=index)
print("0th value is =",data_ser.iloc[0]) #implicit index
print("0th value is =",data_ser.loc['a'])#explicit index

print("last value is =",data_ser.iloc[12])
print("last value is =",data_ser.loc['m'])


0th value is = 25
0th value is = 25
last value is = 89
last value is = 89


In [52]:
#Setting and Resetting Index:

import pandas as pd 
numbers=[25,76,45,34,45,55,67,89,56,43,55,67,89,]
index=['a','b','c','d','e','f','g','h','i','j',"k","l","m"]

s=pd.Series(numbers,index=index)
print(s)
s.reset_index(drop=True, inplace=True) #resetting the index 
print(s)

a    25
b    76
c    45
d    34
e    45
f    55
g    67
h    89
i    56
j    43
k    55
l    67
m    89
dtype: int64
0     25
1     76
2     45
3     34
4     45
5     55
6     67
7     89
8     56
9     43
10    55
11    67
12    89
dtype: int64


In [60]:
#arithmatic operations on pandas series 
import pandas as pd

series1 = pd.Series([4, 13, 20, 16])
series2 = pd.Series([5, 6, 7, 8])

# Addition
result = series1 + series2
print("the addition result\n",result)

# Subtraction
result = series1 - series2
print("the subtraction result\n",result)
# Multiplication
result = series1 * series2
print("the multiplication result\n",result)

# Division
result = series1 / series2
print("the division result\n",result)

# floor Division
result = series1 // series2
print("the floor division result\n",result)


the addition result
 0     9
1    19
2    27
3    24
dtype: int64
the subtraction result
 0    -1
1     7
2    13
3     8
dtype: int64
the multiplication result
 0     20
1     78
2    140
3    128
dtype: int64
the division result
 0    0.800000
1    2.166667
2    2.857143
3    2.000000
dtype: float64
the floor division result
 0    0
1    2
2    2
3    2
dtype: int64


In [61]:
# floor Division
result = series1 % series2
print("the modulation result\n",result)

the modulation result
 0    4
1    1
2    6
3    0
dtype: int64


In [62]:
# exponential result
result = series1 ** series2
print("the floor division result\n",result)

the floor division result
 0          1024
1       4826809
2    1280000000
3    4294967296
dtype: int64


In [68]:
#aggregate function 
#mean(), median(), std(), sum(), min(), max()
import pandas as pd

series = pd.Series([2, 8, 16, 4])
# Sum
total = series.sum()
print("the total of the series = ",total)
# Mean
average = series.mean()
print("the avg of the series = ",average)
# Maximum
maximum_value = series.max()
print("the maximum value of the series = ",maximum_value)
# Minimum
minimum_value = series.min()
print("the min value of the series = ",minimum_value)

# Median
median_value = series.median()
print("the median value of the series = ",median_value)

# std
std_value = series.std()
print("the std value of the series = ",std_value)

the total of the series =  30
the avg of the series =  7.5
the maximum value of the series =  16
the min value of the series =  2
the min value of the series =  6.0
the min value of the series =  6.191391873668904


In [71]:
"""
Handling Missing Data:
Methods like dropna() and fillna(value) are used to handle missing data.
"""
import numpy as np 
import pandas as pd 

data=pd.Series([21,23,55,67,np.nan,99,23,np.nan,np.nan])
print(data)

new_data=data.dropna()
print(new_data)

new_data=data.fillna(data.max())
print(new_data)

0    21.0
1    23.0
2    55.0
3    67.0
4     NaN
5    99.0
6    23.0
7     NaN
8     NaN
dtype: float64
0    21.0
1    23.0
2    55.0
3    67.0
5    99.0
6    23.0
dtype: float64
0    21.0
1    23.0
2    55.0
3    67.0
4    99.0
5    99.0
6    23.0
7    99.0
8    99.0
dtype: float64


In [72]:

new_data=data.fillna(22)
print(new_data)

0    21.0
1    23.0
2    55.0
3    67.0
4    22.0
5    99.0
6    23.0
7    22.0
8    22.0
dtype: float64
