# **Introduction to Pandas**

## __Agenda__

- Fundamentals of Pandas
  * Purpose of Pandas
  * Features of Pandas
- Data Structures
- Introduction to Series
  * Creating and Accessing Pandas Series Using Different Methods
  * Basic Information in Pandas Series
  * Operations and Transformations in Pandas Series
  * Querying a Series

## __1. Fundamentals of Pandas__

https://pandas.pydata.org/

Pandas is an open-source library built on top of NumPy and is used for data manipulation.

- It introduces data structures like DataFrame and Series that make working with structured data more efficient.

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/pandas.png)

### __1.1 Purpose of Pandas__
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Purpose_of_Pandas.png)

### __1.2 Features of Pandas__
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Updated_Images/Lesson_4/4_01/Features_of_Pandas.png)




## __2. Data Structures__
The two main libraries of Pandas data structure are:
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_04_Working_with_Pandas/1_Introduction_to_Pandas/Data_Structures.png)

## __3. Introduction to Series__
A Series is a one-dimensional array-like object containing data and labels or index.

It can be created with different data inputs:
![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Updated_Images/Lesson_4/4_01/Introduction_to_Series.png)

pip install pandas

### __3.1 Creating and Accessing Pandas Series Using Different Methods:__

In [3]:
import pandas as pd

# Creating a Pandas Series from a list
data = [1, 2, 3.3,  5]
series = pd.Series(data, index=['a','1b',3.7,100])
print(series)

a      1.0
1b     2.0
3.7    3.3
100    5.0
dtype: float64


In [5]:
data = [10, 20, 30,40,50]
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [13]:
# Creating a Pandas Series with a specified index
index = ['a', 'b', 'c', 'd', 'e']
series_with_index = pd.Series(data, index=index)
print(series_with_index)
# Accessing data in a Series
print(series_with_index.iloc[2])  # Accessing element at index 2
print(series_with_index.loc['b'])  # Accessing element with index 'b'
print(series_with_index.iloc[-2])  # Accessing element at index -2


a    10
b    20
c    30
d    40
e    50
dtype: int64
30
20
40


Indexing: 

iloc - integer based indexing
loc - label based indexing

In [15]:
print(series_with_index)
print(series_with_index.iloc[2:4]) 
print(series_with_index.loc['c':'e']) 

a    10
b    20
c    30
d    40
e    50
dtype: int64
c    30
d    40
dtype: int64
c    30
d    40
e    50
dtype: int64


In [41]:
# Creating a Pandas Series from a dictionary
data_dict = {'cat': 1, 'b': 2, 'c': 3, 111: 4, 'e': 'bus'}
series_from_dict = pd.Series(data_dict)
print(series_from_dict)

cat      1
b        2
c        3
111      4
e      bus
dtype: object


### __3.2 Basic Information in Pandas Series__
These functions collectively help analysts summarize and understand the characteristics of the data, facilitating effective data exploration and analysis.

In [19]:
print(series)
# Return the first n rows
print("head")
first_n_rows = series.head(3)
print(first_n_rows)

0    10
1    20
2    30
3    40
4    50
dtype: int64
head
0    10
1    20
2    30
dtype: int64


In [27]:
print(series)
# Return the last n rows
print("tail")
last_n_rows = series.tail(2)
print(last_n_rows)

0    1
1    2
2    3
3    4
4    5
dtype: int64
tail
3    4
4    5
dtype: int64


In [21]:
# Return dimensions (Rows, columns)
dimensions = series.shape
print(dimensions)

(5,)


In [45]:
# Generate descriptive statistics
stats = series.describe()
print(stats)

count    4.000000
mean     2.750000
std      1.707825
min      1.000000
25%      1.750000
50%      2.500000
75%      3.500000
max      5.000000
dtype: float64


In [41]:
# Creating a Pandas Series from a list
data = [1, 2, 3,3, 5]
series = pd.Series(data, index=['a','1b',3.7,100,1])
print(series)

# Return unique values
unique_values = series.unique()
print(unique_values)


a      1
1b     2
3.7    3
100    3
1      5
dtype: int64
[1 2 3 5]


In [33]:
# Return the number of unique values
num_unique_values = series.nunique()
print(num_unique_values)

4


### __3.3 Operations and Transformations in Pandas Series__
Operations and transformations in Pandas Series are crucial for modifying, enhancing, and cleaning data effectively.

They provide flexibility to adapt data to specific analyses or visualizations, preparing it for meaningful insights and ensuring data quality.

In [25]:
print(series)
print(series_with_index)
# Element-wise addition
result_series = series + series_with_index
print(result_series)



0    10
1    20
2    30
3    40
4    50
dtype: int64
a    10
b    20
c    30
d    40
e    50
dtype: int64
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
dtype: float64


In [70]:
# Check for missing values
missing_values = result_series.isnull()
print(missing_values)



1       True
3.7     True
100     True
1b      True
a      False
b       True
c       True
d       True
e       True
dtype: bool


In [39]:
# Check for missing values
missing_values = result_series.isna()
print(missing_values)

1       True
3.7     True
100     True
1b      True
a      False
b       True
c       True
d       True
e       True
dtype: bool


In [27]:
# Fill missing values with a specified value
filled_series = result_series.fillna(10)
print(filled_series)

0    10.0
1    10.0
2    10.0
3    10.0
4    10.0
a    10.0
b    10.0
c    10.0
d    10.0
e    10.0
dtype: float64


In [43]:
print(series)

# Apply a function to each element
squared_series = series.apply(lambda x: x**2)
print(squared_series)

# Map values using a dictionary
mapped_series = series.map({1: 'one', 2: 'two', 3: 'three'})
print(mapped_series)

# Sort the Series by values
sorted_series = series.sort_values()
print(sorted_series)

# Check for missing values
missing_values = series.isnull()
print(missing_values)

# Fill missing values with a specified value
filled_series = series.fillna(0)
print(filled_series)

a      1
1b     2
3.7    3
100    3
1      5
dtype: int64
a       1
1b      4
3.7     9
100     9
1      25
dtype: int64
a        one
1b       two
3.7    three
100    three
1        NaN
dtype: object
a      1
1b     2
3.7    3
100    3
1      5
dtype: int64
a      False
1b     False
3.7    False
100    False
1      False
dtype: bool
a      1
1b     2
3.7    3
100    3
1      5
dtype: int64


### __3.4 Querying a Series__
Selecting and filtering data based on specific conditions is an essential aspect of querying a Pandas Series.

The following examples illustrate common querying operations that can be applied to a Pandas Series:

In [74]:
import pandas as pd

# Create a Pandas Series
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}
series = pd.Series(data)
print(series)                  

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [76]:
# Select elements greater than 30
selected_greater_than_30 = series[series > 30]
print(selected_greater_than_30)

d    40
e    50
dtype: int64


In [80]:
# Select elements equal to 20
selected_equal_to_20 = series[series == 20]

# Select elements not equal to 40
selected_not_equal_to_40 = series[series != 40]

# Select elements based on multiple conditions
selected_multiple_conditions = series[(series > 20) & (series < 50)]

# Select elements based on a list of values
selected_by_list = series[series.isin([20, 40, 60])]



# Query based on index labels
selected_by_index_labels = series.loc[['a', 'c', 'e']]

# Query based on numeric position
selected_by_numeric_position = series.iloc[1:4]

# Display the results
print("Original Series:")
print(series)
print("\nSelected greater than 30:")
print(selected_greater_than_30)
print("\nSelected equal To 20:")
print(selected_equal_to_20)
print("\nSelected not equal to 40:")
print(selected_not_equal_to_40)
print("\nSelected based on multiple conditions:")
print(selected_multiple_conditions)
print("\nSelected based on list of values:")
print(selected_by_list)

print("\nSelected based on index labels:")
print(selected_by_index_labels)
print("\nSelected based on numeric position:")
print(selected_by_numeric_position)

Original Series:
a    10
b    20
c    30
d    40
e    50
dtype: int64

Selected greater than 30:
d    40
e    50
dtype: int64

Selected equal To 20:
b    20
dtype: int64

Selected not equal to 40:
a    10
b    20
c    30
e    50
dtype: int64

Selected based on multiple conditions:
c    30
d    40
dtype: int64

Selected based on list of values:
b    20
d    40
dtype: int64

Selected based on index labels:
a    10
c    30
e    50
dtype: int64

Selected based on numeric position:
b    20
c    30
d    40
dtype: int64


In [29]:
# Select elements using string methods (if applicable)
string_series = pd.Series(['apple', 'banana', 'berry', 'date', 'elderberry'],index=range(1,6))
print(string_series)
selected_by_string_method = string_series[string_series.str.startswith('b')]
print("\nSelected based on string method (startswith):")
print(selected_by_string_method)

1         apple
2        banana
3         berry
4          date
5    elderberry
dtype: object

Selected based on string method (startswith):
2    banana
3     berry
dtype: object


# __Assisted Practice__

## __Problem Statement:__
Use Pandas Series to analyze sales data for a retail store over a week and draw insights from the data.

### __Dataset:__
__Sample sales data__

sales_data = [120, 150, 130, 170, 160, 180, 140]

days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

## __Steps to Perform__

1. Create a Pandas Series for sales data
- Use a list of daily sales figures to create a Pandas Series
- Assign days of the week as the index
2. Access and manipulate sales data
- Access sales data for specific days using index labels
- Calculate total sales for the week
- Identify the day with the highest and lowest sales
   
3. Basic analysis of sales data
- Calculate the average sales for the week
- Determine the days with sales figures significantly different from the average

In [8]:
import pandas as pd
data = [120, 150, 130, 170, 160, 180, 140]
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sales = pd.Series(data, index=days)
print (sales_index)

Monday       120
Tuesday      150
Wednesday    130
Thursday     170
Friday       160
Saturday     180
Sunday       140
dtype: int64


In [26]:
print(sales[["Monday", 'Tuesday', 'Wednesday']])
print("Sales on Friday:", sales.loc['Friday'])
print( sales.iloc[3])
print(sales.iloc[6])

Monday       120
Tuesday      150
Wednesday    130
dtype: int64
Sales on Friday: 160
170
140


In [28]:
sales.shape

(7,)

In [30]:
sales.ndim

1

In [32]:
Total_sales = sales.sum()
print(Total_sales)

1050


In [34]:
sales.describe()

count      7.000000
mean     150.000000
std       21.602469
min      120.000000
25%      135.000000
50%      150.000000
75%      165.000000
max      180.000000
dtype: float64

In [46]:
print(days[sales.argmax()])
print(days[sales.argmin()])



Saturday
Monday


In [49]:
avg_sales = sales.mean()
print(avg_sales)


150.0


In [55]:
import numpy as np
print ('Average Sales = ',np.mean(data))

Average Sales =  150.0


In [57]:
Different = sales[(sales!=avg_sales)]
print(Different)

Monday       120
Wednesday    130
Thursday     170
Friday       160
Saturday     180
Sunday       140
dtype: int64


In [63]:
import pandas as pd
data = [120, 150, 130, 170, 160, 180, 140]
sales1 = pd.Series(data)
print (sales1)

0    120
1    150
2    130
3    170
4    160
5    180
6    140
dtype: int64


In [65]:
data2 = [12, 10, 10, 17, 16, 18, 14]
sales2 = pd.Series(data2)
print (sales2)

0    12
1    10
2    10
3    17
4    16
5    18
6    14
dtype: int64


In [67]:
sales1

0    120
1    150
2    130
3    170
4    160
5    180
6    140
dtype: int64

In [69]:
sales1+sales2

0    132
1    160
2    140
3    187
4    176
5    198
6    154
dtype: int64

In [77]:
sales.index

Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'],
      dtype='object')

In [79]:
sales.index[2:5]

Index(['Wednesday', 'Thursday', 'Friday'], dtype='object')