<font size=7><b> Pandas

<font size=6> *What is Pandas?*

<font size=4>-Pandas is a Python library used for working with data sets.
    

<font size=4>-It has functions for analyzing, cleaning, exploring, and manipulating data.

<font size=4>-The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

<font size=6> *Why use of Pandas?*

<font size=4>-Pandas allows us to analyze big data and make conclusions based on statistical theories.

<font size=4>-Pandas can clean messy data sets, and make them readable and relevant.

<font size=4>-Relevant data is very important in data science.

<font size=6>What Can Pandas Do?

<font size=4>Pandas gives you answers about the data. Like:


<font size=4>-Is there a correlation between two or more columns?
    
<font size=4>-What is average value?
    
<font size=4>-Max value?
    
<font size=4>-Min value?
    
<font size=4>Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

<font size=6> Installation of Pandas

In [1]:
! pip install pandas



<font size=6>Import Pandas

<font size=4>Once Pandas is installed, import it in your applications by adding the **import** keyword:

In [2]:
import pandas

<font size=4>Now Pandas is imported and ready to use.

In [3]:
import pandas

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


<font size=6>Pandas as pd

<font size=4>Pandas is usually imported under the **pd** alias.

<font size=4>Create an alias with the **as** keyword while importing:

In [1]:
import pandas as pd

<font size=4>Now the Pandas package can be referred to as **pd** instead of **pandas**.

In [5]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


<font size=7>**Pandas Series**

<font size=6>What is a Series?

<font size=4>-A Pandas Series is like a column in a table.

<font size=4>-It is a one-dimensional array holding data of any type.

In [2]:
import numpy as np
import pandas as pd
Series1=pd.Series(np.random.randn(4))
print(Series1)
print(Series1.index)

0   -0.268666
1    0.244631
2   -0.557320
3    1.149906
dtype: float64
RangeIndex(start=0, stop=4, step=1)


<font size=6>*Series from lists*

In [3]:
# string
a = ['a','b','c','d','e']

pd.Series(a)

0    a
1    b
2    c
3    d
4    e
dtype: object

In [4]:
# integers
b = [13,24,56,78,100]

b_series= pd.Series(b)
print(b_series)

0     13
1     24
2     56
3     78
4    100
dtype: int64


<font size=6>*Creating a Series from Dictionary*

In [5]:
dict={'m':1,'y':2024,'d':'sunday'}

Series3=pd.Series(dict)
print(Series3)

Series4=pd.Series(dict,index=['y','m','d'])
print(Series4)

m         1
y      2024
d    sunday
dtype: object
y      2024
m         1
d    sunday
dtype: object


<font size=6>*Labels*

<font size=4>-If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.


<font size=4>-This label can be used to access a specified value.

In [6]:
print(Series1[0])

-0.2686658382443255


<font size=6>*Create Labels*

<font size=4> With the **index** argument, you can name your own labels.

In [7]:
Series1=pd.Series(np.random.randn(4),index=['a','b','c','d'])
print(Series1)
print(Series1.index)

a    1.273929
b    0.650178
c   -0.189962
d   -1.775883
dtype: float64
Index(['a', 'b', 'c', 'd'], dtype='object')


In [8]:
print(Series1["b"])

0.6501775707458912


<font size=6>*Creating a List from Scaler Value*

In [9]:
print('creating series from scaler value')
scl=pd.Series(8,index=['a','b','c','d'])
print(scl)

creating series from scaler value
a    8
b    8
c    8
d    8
dtype: int64


<font size=6>Scaler Value without Indexing

In [10]:
print('creating series from scaler value')
scl=pd.Series(8)
print(scl)

creating series from scaler value
0    8
dtype: int64


<font size=6>*Descriptive Statistics in a Series*

In [11]:
print('Series1')
print(Series1)

#series sample operation
print('\n mean:',Series1.mean())#Calculates the mean of the values in the Series.

print('\n median:',Series1.median())#Finds the middle value of the Series when the values are sorted

print('\n maximum:',Series1.max())# Returns the maximum value in the Series.

print('\n minimum:',Series1.min()) # Returns the minimum value in the Series.

# Calculates the standard deviation, which measures the amount of variation or dispersion in the values.
print('\n standerd daviation:',Series1.std())

print('\n sum:',Series1.sum()) # Computes the total sum of all values in the Series

Series2=pd.Series([1,10,5,17,9])
print('\n Series2')
print(Series2)

print('\n sum:',Series2.sum())

'''Sorts the values of the Series in ascending order and returns a new Series with the same indices but rearranged values. 
The original Series remains unchanged.'''
print('\n sorted values:')
print( Series2.sort_values())

''' Sorts the Series based on its index. Since the default indices are already in order, 
it will return the Series in its original order.'''
print('\n sorted indexes:')
print( Series2.sort_index())

Series1
a    1.273929
b    0.650178
c   -0.189962
d   -1.775883
dtype: float64

 mean: -0.010434512729797396

 median: 0.2301077891537096

 maximum: 1.273929403417992

 minimum: -1.775883032644601

 standerd daviation: 1.3209892235715492

 sum: -0.041738050919189584

 Series2
0     1
1    10
2     5
3    17
4     9
dtype: int64

 sum: 42

 sorted values:
0     1
2     5
4     9
1    10
3    17
dtype: int64

 sorted indexes:
0     1
1    10
2     5
3    17
4     9
dtype: int64


<font size=6>*Mathematical Operations on Series*

In [37]:
# Create a Series
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
print("Original Series:")
print(s1)

# Addition
print("\nAddition:")
print(s1 + s1)

# Addition with non-matched labels
print('\nAddition with non-matched labels:')
print(s1[1:] + s1[:-1])

# Multiplication
print('\nMultiplication:')
print(s1 * s1)

# More Mathematical Operations
print("\nMore Mathematical Operations:")

# Subtraction
print("Subtraction (s1 - s1):")
print(s1 - s1)

# Division
print("\nDivision (s1 / 2):")
print(s1 / 2)

# Exponentiation
print("\nExponentiation (s1 ** 2):")
print(s1 ** 2)

# Square Root
print("\nSquare Root (s1.apply(np.sqrt)):")
print(s1.apply(np.sqrt))

# Cumulative Sum
'''Computes the cumulative sum of the elements in the Series. 
Each element is replaced by the sum of all previous elements up to that point.'''
print("\nCumulative Sum:")
print(s1.cumsum())

# Cumulative Product
print("\nCumulative Product:")
print(s1.cumprod())#Computes the cumulative product of the elements in the Series.

Original Series:
a    1
b    2
c    3
d    4
dtype: int64

Addition:
a    2
b    4
c    6
d    8
dtype: int64

Addition with non-matched labels:
a    NaN
b    4.0
c    6.0
d    NaN
dtype: float64

Multiplication:
a     1
b     4
c     9
d    16
dtype: int64

More Mathematical Operations:
Subtraction (s1 - s1):
a    0
b    0
c    0
d    0
dtype: int64

Division (s1 / 2):
a    0.5
b    1.0
c    1.5
d    2.0
dtype: float64

Exponentiation (s1 ** 2):
a     1
b     4
c     9
d    16
dtype: int64

Square Root (s1.apply(np.sqrt)):
a    1.000000
b    1.414214
c    1.732051
d    2.000000
dtype: float64

Cumulative Sum:
a     1
b     3
c     6
d    10
dtype: int64

Cumulative Product:
a     1
b     2
c     6
d    24
dtype: int64


<font size=6>Name Attribute

<font size=4>In Pandas, the **name** attribute is used to assign or retrieve the name of a Series or DataFrame. It allows you to give a meaningful label to your data structure, which can be helpful for identification and clarity, especially when dealing with multiple Series or DataFrames.

In [46]:
# Create a Series and assign a name
s2=pd.Series([777,88,65,90],name='studentmarks')
print(s2)

print('\nName of S2',s2.name)

s2=s2.rename('marks')#changing the name of the series

print('\nNew name of S2 :-',s2.name)

0    777
1     88
2     65
3     90
Name: studentmarks, dtype: int64

Name of S2 studentmarks

New name of S2 :- marks


In [49]:
# Create a Series and assign a name
s = pd.Series([1, 2, 3, 4], name='MySeries')
print(s)
print("\nSeries Name:", s.name)
s.name = 'UpdatedSeries'#changing the name of the series
print(s)
print("\nSeries Name:", s.name)

0    1
1    2
2    3
3    4
Name: MySeries, dtype: int64

Series Name: MySeries
0    1
1    2
2    3
3    4
Name: UpdatedSeries, dtype: int64

Series Name: UpdatedSeries


<font size=6>Series Attributes

<font size=5>**index**

* <font size=4>The **index** attribute returns the index (labels) of the Series.
* <font size=4>It can be used to access the labels associated with the data.

In [50]:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print(s.index)  # Output: Index(['a', 'b', 'c'], dtype='object')


Index(['a', 'b', 'c'], dtype='object')


<font size=5>**values**

* <font size=4>The **values** attribute returns the underlying data as a NumPy array.
* <font size=4>This can be useful for performing NumPy operations.

In [51]:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print(s.values)  # Output: array([1, 2, 3])

[1 2 3]


<font size=5>**name**

* <font size=4>The **name** attribute allows you to get or set the name of the Series.
* <font size=4>It is helpful for identifying the Series, especially when used in a DataFrame.

In [52]:
s.name = 'MySeries'
print(s.name)  # Output: MySeries

MySeries


<font size=5>**dtype**

* <font size=4>The **dtype** attribute returns the data type of the Series elements.
* <font size=4>It provides insight into the types of data stored in the Series.

In [53]:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print(s.dtype)  # Output: int64

int64


<font size=5>**size**

* <font size=4>The **size** attribute returns the number of elements in the Series.
* <font size=4>It gives a quick count of how many items are present.

In [54]:
print(s.size)  # Output: 3

3


<font size=5>**shape**

* <font size=4>The **shape** attribute returns a tuple representing the dimensions of the Series (i.e., the number of elements).
* <font size=4>For a one-dimensional Series, it returns a tuple with one element.

In [55]:
print(s.shape)  # Output: (3,)

(3,)


<font size=5>**is_unique**

<font size=4>The **is_unique** attribute in a Pandas Series checks whether all the values in the Series are unique. It returns a Boolean value: True if all values are unique and False if there are any duplicates.

* <font size=4>marks_series.is_unique will return True if all the values in marks_series are distinct.
* <font size=4>If any value appears more than once, it will return False.

In [56]:
# Create a Series with unique values
marks_series = pd.Series([90, 85, 78, 92, 88])
print(marks_series.is_unique)  # Output: True

# Create a Series with duplicate values
duplicate_series = pd.Series([1, 1, 2, 3, 4, 5])
print(duplicate_series.is_unique)  # Output: False


True
False


In [57]:
marks_series = pd.Series([90, 85, 78, 92, 88])
print(marks_series.is_unique)  # Output: True

True


In [58]:
duplicate_series = pd.Series([1, 1, 2, 3, 4, 5])
print(duplicate_series.is_unique)  # Output: False


False


<font size=5>**isnull()** and **notnull()**

* <font size=4>These methods can be used to check for missing values in the Series.
* <font size=4>**isnull()** returns a Boolean Series indicating where values are missing.
* <font size=4>**notnull()** returns a Boolean Series indicating where values are present.

In [59]:
s_with_nan = pd.Series([1, 2, None, 4])
print(s_with_nan.isnull())  # Output: [False, False, True, False]
print(s_with_nan.notnull())  # Output: [ True,  True, False,  True]

0    False
1    False
2     True
3    False
dtype: bool
0     True
1     True
2    False
3     True
dtype: bool


<font size=6>Indexing

In [88]:
# Create a Series with custom index
data = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(data)


a    10
b    20
c    30
d    40
e    50
dtype: int64


<font size=5>1.Accessing Single Elements

<font size=4>You can access elements in a Series using either labels (index) or integer position:

In [89]:
#Using Label Indexing:
print(data['c'])  # Output: 30

30


In [90]:
#Using Integer Position:
print(data[2])    # Output: 30

30


<font size=5>2. Accessing Multiple Elements

<font size=4>You can access multiple elements by passing a list of labels or integer positions.

In [91]:
#Using Label Indexing:
print(data[['a', 'c', 'e']])

a    10
c    30
e    50
dtype: int64


In [94]:
#Using Integer Position or Fancy Indexing:
print(data[[0, 2, 4]])

print()

print(data[[1, 3]]) 

print()

print(data[[1, 2, 3]]) 

a    10
c    30
e    50
dtype: int64

b    20
d    40
dtype: int64

b    20
c    30
d    40
dtype: int64


<font size=6>Slicing

<font size=4>Slicing allows you to extract a subset of the Series.

In [68]:
# Create a Series with custom index
data = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(data)

a    10
b    20
c    30
d    40
e    50
dtype: int64


<font size=5>1. Slicing with Labels

<font size=4>You can slice a Series using labels, which includes the start and end labels.

In [70]:
print(data['b':'d']) 

b    20
c    30
d    40
dtype: int64


<font size=5>2. Slicing with Integer Positions

<font size=4>you can also slice using integer positions. The end index is exclusive.

In [77]:
print(data[1:4])

print()

print(data[2:5])

print()

print(data[::-1])

print()

print(data[0::2])

b    20
c    30
d    40
dtype: int64

c    30
d    40
e    50
dtype: int64

e    50
d    40
c    30
b    20
a    10
dtype: int64

a    10
c    30
e    50
dtype: int64


<font size=6>Boolean Indexing

<font size=4>You can also slice a Series based on conditions

In [83]:
# Get elements greater than 25
print(data[data > 25]) 

print()

# Get elements less than 25
print(data[data < 25]) 

print()

# Get element more than equal to 20
print(data[data >=20])

print()

# Get element less than equal to 20
print(data[data <=20])

c    30
d    40
e    50
dtype: int64

a    10
b    20
dtype: int64

b    20
c    30
d    40
e    50
dtype: int64

a    10
b    20
dtype: int64


<font size=6>Modifying Elements

<font size=4>You can modify elements in a Series by indexing or slicing.

In [84]:
# Create a Series with custom index
data = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(data)

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [86]:
data['a'] = 15  # Change value at index 'a'
print(data) 

print()

data[1:3] = [25, 35]  # Modify a slice
print(data)  


a    15
b    25
c    35
d    40
e    50
dtype: int64

a    15
b    25
c    35
d    40
e    50
dtype: int64


<font size=6>Looping in Series

<font size=4>In Pandas, looping in series can involve iterating over rows or elements in a DataFrame or Series. While it’s generally more efficient to use vectorized operations, sometimes you might need to loop for specific tasks. Here’s an example using a Pandas Series and a DataFrame

In [2]:
# Create a Pandas Series
data = pd.Series([1, 2, 3, 4, 5])
# Create an empty list to store the results
squared_values = []

# Loop through each value in the Series
for value in data:
    squared_values.append(value ** 2)  # Square the value and append to the list

# Convert the list back to a Series
squared_series = pd.Series(squared_values)

print(squared_series)


0     1
1     4
2     9
3    16
4    25
dtype: int64


<font size=4>In this example, we will use a while loop to increment each value by 1.

In [3]:
# Create a Pandas Series
data = pd.Series([1, 2, 3, 4, 5])
# Create an empty list to store the results
incremented_values = []

# Initialize a counter
i = 0

# Loop through the Series using a while loop
while i < len(data):
    incremented_values.append(data[i] + 1)  # Increment the value and append to the list
    i += 1  # Increment the counter

# Convert the list back to a Series
incremented_series = pd.Series(incremented_values)

print(incremented_series)


0    2
1    3
2    4
3    5
4    6
dtype: int64


<font size=6>Some Important Series Methods

<font size=5>***astype()***

<font size=4>The astype() method is used to cast a Pandas object to a specified data type.

In [4]:
# Create a Series
s = pd.Series(['1', '2', '3'])

# Convert to integer type
s = s.astype(int)
print(s)


0    1
1    2
2    3
dtype: int32


<font size=5>***between()***

<font size=4>Filters the Series to check if values fall within a specified range.

In [5]:
# Create a Series
s = pd.Series([1, 2, 3, 4, 5])

# Filter values between 2 and 4
filtered = s[s.between(2, 4)]
print(filtered)


1    2
2    3
3    4
dtype: int64


<font size=5>***clip()***

<font size=4>Limits the values in the Series to a specified range.

In [6]:
# Create a Series
s = pd.Series([1, 2, 3, 4, 5])

# Clip values to be between 2 and 4
clipped = s.clip(lower=2, upper=4)
print(clipped)


0    2
1    2
2    3
3    4
4    4
dtype: int64


<font size=5>***drop_duplicates()***

<font size=4>Removes duplicate values from the Series.

In [7]:
# Create a Series with duplicates
s = pd.Series([1, 1, 2, 3, 3])

# Drop duplicates
unique = s.drop_duplicates()
print(unique)


0    1
2    2
3    3
dtype: int64


<font size=5>***isnull()***

<font size=4>Detects missing values (NaN) in the Series.

In [8]:
# Create a Series with a missing value
s = pd.Series([1, None, 3])

# Check for null values
null_mask = s.isnull()
print(null_mask)


0    False
1     True
2    False
dtype: bool


<font size=5>***dropna()***

<font size=4>Removes missing values from the Series.

In [9]:
# Create a Series with missing values
s = pd.Series([1, None, 3])

# Drop missing values
cleaned = s.dropna()
print(cleaned)


0    1.0
2    3.0
dtype: float64


<font size=5>***fillna()***

<font size=4>Replaces missing values with a specified value.

In [10]:
# Create a Series with a missing value
s = pd.Series([1, None, 3])

# Fill missing values with 0
filled = s.fillna(0)
print(filled)


0    1.0
1    0.0
2    3.0
dtype: float64


<font size=5>***isin()***

<font size=4>Checks if elements in the Series are contained in a specified list.

In [11]:
# Create a Series
s = pd.Series([1, 2, 3, 4])

# Check if values are in the list [1, 3]
mask = s.isin([1, 3])
print(s[mask])


0    1
2    3
dtype: int64


<font size=5>***apply()***

<font size=4>Applies a function to each element in the Series.

In [13]:
# Create a Series
s = pd.Series([1, 2, 3])

# Apply a function to square each value
squared = s.apply(lambda x: x ** 2)
print(squared)


0    1
1    4
2    9
dtype: int64


<font size=5>***copy()***

<font size=4>Creates a copy of the Series.

In [14]:
# Create a Series
s = pd.Series([1, 2, 3])

# Create a copy of the Series
s_copy = s.copy()

# Modify the original Series
s[0] = 10

print("Original Series:")
print(s)
print("\nCopied Series:")
print(s_copy)


Original Series:
0    10
1     2
2     3
dtype: int64

Copied Series:
0    1
1    2
2    3
dtype: int64
