# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#1: Introduction to Pandas`**
1. **Overview of Pandas**
   - What is Pandas?
   - Why use Pandas for data analysis?

2. **Installation and Setup**
   - Installing Pandas
   - Importing Pandas in a Python environment

3. **Pandas Data Structures**
   - Series
   - DataFrame

### **`3. Pandas Data Structures`**


#### `Pandas Series`

**Definition:**
A Pandas Series is a one-dimensional labeled array capable of holding any data type. It is a fundamental data structure in Pandas and provides a labeled index to access and manipulate data. Unlike a NumPy array, a Pandas Series can hold data of different types.

**Differences from Other Data Structures:**

1. **NumPy Array:**
   - A Pandas Series is conceptually similar to a NumPy array but is augmented with labels, giving it more flexibility.
   - While NumPy arrays have an implicitly defined integer index, Pandas Series has an explicitly defined index associated with each element.

2. **Python List:**
   - In contrast to Python lists, a Pandas Series can have a custom index, enabling more expressive and meaningful data representation.
   - Series allows for more efficient data manipulation and provides additional functionalities for data analysis.


**Creation of Series:**


1. **From a List:**
   - You can create a Series from a Python list using the `pd.Series()` constructor:

In [1]:
import pandas as pd

data_list = [1, 2, 3, 4, 5]
series_from_list = pd.Series(data_list)
print(series_from_list)

0    1
1    2
2    3
3    4
4    5
dtype: int64


2. **Specifying Custom Index:**
   - You can specify a custom index for the Series:

In [2]:
custom_index = ['a', 'b', 'c', 'd', 'e']
series_with_custom_index = pd.Series(data_list, index=custom_index)
print(series_with_custom_index)

a    1
b    2
c    3
d    4
e    5
dtype: int64


3. **From a Dictionary:**
   - You can create a Series from a Python dictionary, where keys become the index and values become the data:

In [4]:
data_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
series_from_dict = pd.Series(data_dict)
print(series_from_dict)

a    1
b    2
c    3
d    4
e    5
dtype: int64


**Basic Operations on Series:**

1. **Indexing:**
   - Accessing elements of a Series can be done using the index:

In [6]:
value_at_b = series_with_custom_index['b']
print(value_at_b)

# Same can be done using the by default index values which are assigned i.e [0,1,2...]
print(series_with_custom_index[1])

2
2


2. **Slicing:**
   - Slicing allows selecting a subset of a Series based on the index:

In [8]:
subset = series_with_custom_index['a':'c']

print(subset)

# Same can be done using the by default index values which are assigned i.e [0,1,2...]
print(series_with_custom_index[0:3])

a    1
b    2
c    3
dtype: int64
a    1
b    2
c    3
dtype: int64


3. **Mathematical Operations:**
   - Series supports element-wise mathematical operations:

In [9]:
multiplied_series = series_from_list * 2

print(multiplied_series)

# Note : Here 2 is multiplied to all the elements of the series same as in case of NumPy arrays

0     2
1     4
2     6
3     8
4    10
dtype: int64


4. **Conditional Indexing:**
   - You can use boolean indexing to filter elements based on a condition:

In [10]:
greater_than_three = series_from_list[series_from_list > 3]
print(greater_than_three)

3    4
4    5
dtype: int64


#### **`Operations on Pandas Series`**

Pandas Series support a wide range of operations, both arithmetic and mathematical. These operations can be applied element-wise, making it convenient for data manipulation and analysis. Here are some of the key operations:

**1. Arithmetic Operations:**

   - **Addition:**
     ```python
     series_sum = series1 + series2
     ```

   - **Subtraction:**
     ```python
     series_diff = series1 - series2
     ```

   - **Multiplication:**
     ```python
     series_product = series1 * series2
     ```

   - **Division:**
     ```python
     series_ratio = series1 / series2
     ```

   - **Scalar Operations:**
     ```python
     series_scaled = series * scalar_value
     ```

**2. Mathematical Operations:**

   - **Square Root:**
     ```python
     series_sqrt = np.sqrt(series)
     ```

   - **Exponential:**
     ```python
     series_exp = np.exp(series)
     ```

   - **Logarithm:**
     ```python
     series_log = np.log(series)
     ```

   - **Trigonometric Functions:**
     ```python
     series_sin = np.sin(series)
     ```



**Conclusion:**
Pandas Series provides a versatile and efficient way to work with one-dimensional labeled data. Its ability to handle different data types, custom indexing, and support for various operations make it a fundamental building block for more complex data manipulations in Pandas.


#### `Pandas DataFrame`

**Introduction:**
A Pandas DataFrame is a two-dimensional, tabular data structure with labeled axes (rows and columns). It is a powerful and versatile tool for handling structured data in Python. The DataFrame is one of the core components of Pandas and is widely used in data analysis and manipulation tasks.

**Significance in Data Analysis:**

1. **Tabular Structure:**
   - DataFrames provide a tabular structure, similar to a spreadsheet, making them well-suited for representing and analyzing structured data.

2. **Labeled Axes:**
   - Both rows and columns in a DataFrame have labeled indices, allowing for easy and intuitive access to data. This makes data manipulation more straightforward and expressive.

3. **Support for Heterogeneous Data Types:**
   - DataFrames can accommodate columns with different data types, including numerical, categorical, and textual data. This flexibility is crucial for handling diverse datasets.

4. **Data Alignment:**
   - Operations between DataFrames automatically align data based on their indices and columns, simplifying complex data manipulations.



**Creating DataFrames:**

1. **From a Dictionary:**
   - You can create a DataFrame from a dictionary where keys become column names and values become the data:

In [3]:
import pandas as pd

data = {'Name': ['Laxman', 'Rajesh', 'Ramulu'],
        'Age': [25, 30, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data)
print(df)

     Name  Age           City
0  Laxman   25       New York
1  Rajesh   30  San Francisco
2  Ramulu   35    Los Angeles


2. **Specifying Index:**
   - You can specify a custom index for the DataFrame:

In [16]:
custom_index = ['person1', 'person2', 'person3']
df_with_custom_index = pd.DataFrame(data, index=custom_index)

print(df_with_custom_index)

           Name  Age           City
person1  Naresh   25       New York
person2   Padma   30  San Francisco
person3  Laxman   35    Los Angeles


**Loading Data into DataFrames:**

1. **From CSV File:**
   - Pandas can read data from various file formats. To read from a CSV file:

In [27]:
df_from_csv = pd.read_csv('data.csv')
print(df_from_csv)

     Name  Age          City
0  Naresh   25       Chennai
1   Padma   30          Pune
2  Laxman   22      Srinagar
3  Rajesh   31          Pune
4    Bala   51          Pune
5   Ganga   32  Mahabubnagar
6  Kishan   31        Ongole


2. **From Excel File:**
   - Reading data from an Excel file:

In [18]:
df_from_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

print(df_from_excel)

       Name  Age
0  Harshita    8
1     Naina    1
2       Sai    5
3   Aanchal    3


**Essential DataFrame Operations:**

1. **Filtering Data:**
   - Filtering rows based on a condition:

In [29]:
filtered_data = df[df['Age'] > 30]

print(filtered_data)

     Name  Age         City
2  Laxman   35  Los Angeles


2. **Grouping Data:**
   - Grouping data based on a column and performing aggregate operations:

In [30]:
grouped_data = df.groupby('City')['Age'].mean()
print(grouped_data)

City
Los Angeles      35.0
New York         25.0
San Francisco    30.0
Name: Age, dtype: float64


3. **Merging DataFrames:**
   - Combining two DataFrames based on a common column:

In [32]:
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Laxman', 'Harshita', 'Naina']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [30, 35, 40]})
merged_df = pd.merge(df1, df2, on='ID')

print(merged_df)

   ID      Name  Age
0   2  Harshita   30
1   3     Naina   35


**Conclusion:**
Pandas DataFrames provide a structured and efficient way to handle and analyze tabular data. Their versatility in creating, loading, and manipulating data makes them a fundamental tool in the data scientist's toolbox.

#### Code to create CSV file with sample data

In [1]:
import pandas as pd

# Sample data
data = {
    'Name': ['Laxman', 'Rajesh', 'Ram'],
    'Age': [25, 30, 22],
    'City': ['Pune', 'Hyderabad', 'Mahabubnagar']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

# Read the CSV file back into a DataFrame
df_from_csv = pd.read_csv('data.csv')

# Print the DataFrame
print(df_from_csv)


     Name  Age          City
0  Laxman   25          Pune
1  Rajesh   30     Hyderabad
2     Ram   22  Mahabubnagar
