# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#1: Introduction to Pandas`**
1. **Overview of Pandas**
   - What is Pandas?
   - Why use Pandas for data analysis?

2. **Installation and Setup**
   - Installing Pandas
   - Importing Pandas in a Python environment

3. **Pandas Data Structures**
   - Series
   - DataFrame

### **`3. Pandas Data Structures`**


#### `Pandas Series`

**Definition:**
1. A Pandas Series is a <mark>`one-dimensional labeled array capable of holding any data type`</mark>. 
2. It is a <mark>`fundamental data structure`</mark> in Pandas and <mark>`provides a labeled index to access and manipulate data.`</mark> 
3. <mark>Unlike a NumPy array, a `Pandas Series can hold data of different types.`</mark>

**Differences from Other Data Structures:**

1. **NumPy Array:**
   - A Pandas Series is <mark>`conceptually similar to a NumPy array but is augmented with labels`</mark>, giving it more flexibility.
   - While <mark>`NumPy arrays have an implicitly defined integer index, Pandas Series has an explicitly defined index associated with each element`</mark>.

2. **Python List:**
   - In contrast to Python lists, a <mark>Pandas Series can have a custom index</mark>, enabling more expressive and meaningful data representation.
   - <mark>Series allows for more efficient data manipulation and provides additional functionalities for data analysis.</mark>


#### What is a Pandas Series

- In the given code, pd.Series refers to the Series class in the pandas library. 
- A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). - It is similar to a column in a DataFrame or a list, but with additional features.

**Creation of Series:**


1. **From a List/ Tuple/ Dict and NumPy Array:**
    - You can create a Series from a **Python list/tuple/dict and** **Numpy Array** using the `pd.Series()` **constructor**:

In [1]:
import pandas as pd
import numpy as np

data_list = [1, 2, 3, 4, 5]

series_from_list = pd.Series(data_list)
print('series_from_list')
print(series_from_list)

series_from_array = pd.Series(np.array(data_list))
print('series_from_array')
print(series_from_array)

data_tuple = 1, 2, 3, 4, 5 # Assumed as tuple
series_from_tuple = pd.Series(data_tuple)
print('series_from_tuple')
print(series_from_tuple)
print(type(data_tuple))

data_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
series_from_dict = pd.Series(data_dict)
print(series_from_dict)

# NOTE : Here we dont require custom index as the key of the dict does the role of custom index

series_from_list
0    1
1    2
2    3
3    4
4    5
dtype: int64
series_from_array
0    1
1    2
2    3
3    4
4    5
dtype: int64
series_from_tuple
0    1
1    2
2    3
3    4
4    5
dtype: int64
<class 'tuple'>
a    1
b    2
c    3
d    4
e    5
dtype: int64


#### 1. How the elements of list, NumPy array and Pandas series are displayed in the output ?
#### 2. And what do u notice in the output ?


In [28]:
import pandas as pd
import numpy as np

data_list = [1, 2, 3, 4, 5]
print('data_list')
print(data_list)
array = np.array(data_list) 
print('array')   
print(array)                
series_from_list = pd.Series(data_list)
print('series_from_list')
print(series_from_list)


data_list
[1, 2, 3, 4, 5]
array
[1 2 3 4 5]
series_from_list
0    1
1    2
2    3
3    4
4    5
dtype: int64


**`Note : `** In the output
1. Elements in the list is separated by comma
2. Elements in the array are separated by space
3. Elements of series are dispayed as column data with by default index starting from 0

#### What happens when we multiply a list, NumPy array or Pandas series by a numeric value ?

In [34]:
import pandas as pd
import numpy as np

data_list = [1, 2, 3, 4, 5]
array = np.array(data_list)               
series_from_list = pd.Series(data_list)
# Now multiply each data structure with 2
new_array = array*2
print('new_array : ',new_array)
new_list = data_list*2
print('new_list :',new_list)
new_series=series_from_list*2
print('new_series :\n',new_series)


new_array :  [ 2  4  6  8 10]
new_list : [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
new_series :
 0     2
1     4
2     6
3     8
4    10
dtype: int64


**`Note :`** 
1. In case of NumPy array and Pandas Series on multiplying with 2 all elemnts of array gets multiplied by 2
2. Whereas in case of list the list get extended 

#### What happens when in a list predominant with numeric values contains a single character and from that list we create NumPy Array and Pandas Series ?


In [3]:
import pandas as pd
import numpy as np

data_list = [1, 2, 3, 4, 5, 'h']
print(data_list)
array = np.array(data_list)
print(array)                
series_from_list = pd.Series(data_list)
print(series_from_list)

[1, 2, 3, 4, 5, 'h']
['1' '2' '3' '4' '5' 'h']
0    1
1    2
2    3
3    4
4    5
5    h
dtype: object


**`Note :`**
1. Since 'h' is present in the data_list, so while creating NumPy array all the elemets including numeric values are coverted to string, this happens because arrays are homogeneous
2. 'h' is not converted to numeric because once array created it will be difficult to find out which element of array was converted from string to numeric. Also note that if we convert numeric to string than in the program we can easily convert it back to numeric using int()
3. NumPy array contains homogeneous data whereas list and Series can have heterogeneous data

2. **Specifying Custom Index:**
   - You can specify a custom index for the Series:

In [2]:
custom_index = ['a', 'b', 'c', 'd', 'e']
series_with_custom_index = pd.Series(data_list, index=custom_index)
print(series_with_custom_index)

a    1
b    2
c    3
d    4
e    5
dtype: int64


3. **From a Dictionary:**
   - You can create a Series from a Python dictionary, where keys become the index and values become the data:

In [4]:
data_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
series_from_dict = pd.Series(data_dict)
print(series_from_dict)

# NOTE : Here we dont require custom index as the key of the dict does the role of custom index

a    1
b    2
c    3
d    4
e    5
dtype: int64


**Basic Operations on Series:**

1. **Indexing:**
   - Accessing elements of a Series can be done using the index:

In [6]:
custom_index = ['a', 'b', 'c', 'd', 'e']
series_with_custom_index = pd.Series(data_list, index=custom_index)
value_at_b = series_with_custom_index['b']
print(value_at_b)

# Same can be done using the by default index values which are assigned i.e [0,1,2...]
print(series_with_custom_index[1])

2
2


2. **Slicing:**
   - Slicing allows selecting a subset of a Series based on the index:

In [35]:
import pandas as pd

# Define custom index and create a Series
custom_index = ['a', 'b', 'c', 'd', 'e']
data_list = [1, 2, 3, 4, 5]
series_with_custom_index = pd.Series(data_list, index=custom_index)

# Slicing with custom index labels
subset = series_with_custom_index['a':'c']
print("Subset using custom index labels:")
print(subset)

# Slicing with default integer index
print("\nSubset using default integer index:")
print(series_with_custom_index[0:3])

Subset using custom index labels:
a    1
b    2
c    3
dtype: int64

Subset using default integer index:
a    1
b    2
c    3
dtype: int64


#### `Note :`

- **Inclusive Slicing**: When you slice a Pandas Series using **<mark>custom index labels</mark>** (`['a':'c']`), the **<mark>slicing is inclusive of both the start and end label</mark>s**.
  
- **Default Integer Index**: If your Series has a **<mark>default integer inde</mark>x** (`[0, 1, 2, ...]`), you can also slice using these integers (`[0:3]`), and **<mark>the slicing is exclusive of the end index</mark>** (end index is not included).


3. **Mathematical Operations:**
   - Series supports element-wise mathematical operations:

In [9]:
multiplied_series = series_from_list * 2

print(multiplied_series)

# Note : Here 2 is multiplied to all the elements of the series same as in case of NumPy arrays

0     2
1     4
2     6
3     8
4    10
dtype: int64


4. **Conditional Indexing:**
   - You can use boolean indexing to filter elements based on a condition:

In [36]:
import pandas as pd

# Example Pandas Series
data = pd.Series([10, 20, 30, 40, 50])

# Boolean indexing to filter data
filtered_data = data[data > 30] # Doubt why are we not specifying column name in boolean indexing data[data['columname'] > 30] ?

print(filtered_data)

3    40
4    50
dtype: int64


#### Explain how is it possible in the above code to filter the data by specifying expression in [] of the Pandas Series ? Is the same possible to do with list as well as NumPy array to give expression ? Please explain it in detail with examples


In Pandas, filtering data using square brackets (`[]`) on a Series allows you to specify a boolean expression to select rows that satisfy the condition specified in the brackets. This approach is commonly used and is known as **`boolean indexing`**.

#### Explanation:
1. **Boolean Indexing**:
   - `data > 30` creates a **<mark>boolean mask</mark>** where each element in the Series is checked against the condition (`> 30`). It returns a Series of boolean values (`True` for elements > 30 and `False` otherwise).

2. **Applying the Boolean Mask**:
   - We use `data[data > 30]` to filter `data` based on the boolean mask. This expression returns a new Series (`filtered_data`) containing only the elements of `data` where the condition `data > 30` is `True`.

**`Note :`**
- This filtering technique is efficient because Pandas performs vectorized operations, meaning it applies the condition to all elements simultaneously.

### Filtering with Python List:
Filtering with boolean expressions directly in square brackets (`[]`) is not directly supported for Python lists. However, you can achieve similar functionality using list comprehension or external libraries like NumPy.

#### Example with Python List:
```python
# Example Python List
my_list = [10, 20, 30, 40, 50]

# Using list comprehension to filter data
filtered_list = [x for x in my_list if x > 30]

print(filtered_list)
```

Output:
```
[40, 50]
```

### Filtering with NumPy Array:
NumPy arrays support boolean indexing similarly to Pandas Series, where you can use boolean expressions to filter elements.

#### Example with NumPy Array:
```python
import numpy as np

# Example NumPy Array
my_array = np.array([10, 20, 30, 40, 50])

# Boolean indexing to filter data
filtered_array = my_array[my_array > 30]

print(filtered_array)
```

Output:
```
[40 50]
```

#### Explanation:
- Here, `my_array > 30` creates a boolean mask for the NumPy array `my_array`. Then, `my_array[my_array > 30]` applies this mask to `my_array`, resulting in `filtered_array` containing elements greater than 30.

In [11]:
import numpy as np

# Example with NumPy array
data_array = np.array([1, 2, 3, 4, 5])

# Filtering the array to find values greater than 3
greater_than_three_array = data_array[data_array > 3]

print(greater_than_three_array)


[4 5]


### **`Operations on Pandas Series`**

Pandas Series are versatile for data manipulation, supporting various arithmetic and mathematical operations. These operations work element-wise, meaning they are applied to each element in the Series independently.

**1. Arithmetic Operations**

In [40]:
import pandas as pd

data1 = [1, 2, 3]
data2 = [4, 5, 6]

series1 = pd.Series(data1, index=['a', 'b', 'c'])
series2 = pd.Series(data2, index=['a', 'b', 'c'])

# Addition
series_sum = series1 + series2
print('series_sum :\n',series_sum)

# Subtraction
series_diff = series1 - series2
print('\nseries_diff : \n',series_diff)

# Multiplication
series_product = series1 * series2
print('\nseries_product : \n',series_product)

# Division
series_ratio = series1 / series2
print('\nseries_ratio : \n',series_ratio)

# Scalar Operation
scalar_value = 2
series_scaled = series1 * scalar_value
print('\nseries_scaled : \n',series_scaled)

series_sum :
 a    5
b    7
c    9
dtype: int64

series_diff : 
 a   -3
b   -3
c   -3
dtype: int64

series_product : 
 a     4
b    10
c    18
dtype: int64

series_ratio : 
 a    0.25
b    0.40
c    0.50
dtype: float64

series_scaled : 
 a    2
b    4
c    6
dtype: int64


<span style="color: #000080;font-weight: bold;">**2. Mathematical Operations**</span>

In [41]:
import pandas as pd
import numpy as np

data1 = [1, 2, 3]
data2 = [4, 5, 6]

series1 = pd.Series(data1, index=['a', 'b', 'c'])
series2 = pd.Series(data2, index=['a', 'b', 'c'])

# Square Root (using NumPy)
series_sqrt = np.sqrt(series1)
print(' series_sqrt :\n',series_sqrt)

# Exponential (using NumPy)
series_exp = np.exp(series1)
print('\n series_exp : \n',series_exp)

# Logarithm (using NumPy) : Calculates the natural log of each element
series_log = np.log(series1)  # Make sure elements are positive for log
print('\n series_log : \n',series_log)

# Trigonometric Functions (using NumPy)
series_sin = np.sin(series1)
print('\n series_sin : \n',series_sin)


 series_sqrt :
 a    1.000000
b    1.414214
c    1.732051
dtype: float64

 series_exp : 
 a     2.718282
b     7.389056
c    20.085537
dtype: float64

 series_log : 
 a    0.000000
b    0.693147
c    1.098612
dtype: float64

 series_sin : 
 a    0.841471
b    0.909297
c    0.141120
dtype: float64


**Conclusion:**
Pandas Series provides a versatile and efficient way to work with one-dimensional labeled data. Its ability to handle different data types, custom indexing, and support for various operations make it a fundamental building block for more complex data manipulations in Pandas.


#### `Pandas DataFrame`

**Introduction:**
1. A Pandas DataFrame is a <mark>`two-dimensional`</mark>, <mark>`tabular data structure with labeled axes (rows and columns)`</mark>. 
2. It is a powerful and versatile tool for handling structured data in Python. 
3. The DataFrame is one of the core components of Pandas and is widely used in data analysis and manipulation tasks.

**Significance in Data Analysis:**

1. **Tabular Structure:**
   - DataFrames provide a tabular structure, similar to a spreadsheet, making them well-suited for representing and analyzing structured data.

2. **Labeled Axes:**
   - <mark>`Both rows and columns in a DataFrame have labeled indices`</mark>, allowing for easy and intuitive access to data. This makes data manipulation more straightforward and expressive.

3. **Support for Heterogeneous Data Types:**
   - <mark>`DataFrames can accommodate columns with different data types, including numerical, categorical, and textual data`</mark>. This flexibility is crucial for handling diverse datasets.

4. **Data Alignment:**
   - <mark>`Operations between DataFrames automatically align data based on their indices and columns`</mark>, simplifying complex data manipulations.

   **`Note`** : Explained in detail with example at the end of the notebook


**Creating DataFrames:**

1. **From a Dictionary:**
   - You can create a DataFrame from a dictionary where keys become column names and values become the data:

In [3]:
import pandas as pd

data = {'Name': ['Laxman', 'Rajesh', 'Ramulu'],
        'Age': [25, 30, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data)
print(df)

     Name  Age           City
0  Laxman   25       New York
1  Rajesh   30  San Francisco
2  Ramulu   35    Los Angeles


2. **Specifying Index:**
   - You can specify a custom index for the DataFrame:

In [16]:
custom_index = ['person1', 'person2', 'person3']
df_with_custom_index = pd.DataFrame(data, index=custom_index)

print(df_with_custom_index)

           Name  Age           City
person1  Naresh   25       New York
person2   Padma   30  San Francisco
person3  Laxman   35    Los Angeles


**Loading Data into DataFrames:**

1. **From CSV File:**
   - Pandas can read data from various file formats. To read from a CSV file:

In [27]:
df_from_csv = pd.read_csv('data.csv')
print(df_from_csv)

     Name  Age          City
0  Naresh   25       Chennai
1   Padma   30          Pune
2  Laxman   22      Srinagar
3  Rajesh   31          Pune
4    Bala   51          Pune
5   Ganga   32  Mahabubnagar
6  Kishan   31        Ongole


2. **From Excel File:**
   - Reading data from an Excel file:

In [18]:
df_from_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

print(df_from_excel)

       Name  Age
0  Harshita    8
1     Naina    1
2       Sai    5
3   Aanchal    3


**Essential DataFrame Operations:**

1. **Filtering Data:**
   - Filtering rows based on a condition:

In [29]:
filtered_data = df[df['Age'] > 30]

print(filtered_data)

     Name  Age         City
2  Laxman   35  Los Angeles


2. **Grouping Data:**
   - Grouping data based on a column and performing aggregate operations:

In [30]:
grouped_data = df.groupby('City')['Age'].mean()
print(grouped_data)

City
Los Angeles      35.0
New York         25.0
San Francisco    30.0
Name: Age, dtype: float64


3. **Merging DataFrames:**
   - Combining two DataFrames based on a common column:

In [18]:
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Laxman', 'Harshita', 'Naina']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [30, 35, 40]})
merged_df = pd.merge(df1, df2, on='ID')

print(merged_df)


   ID      Name  Age
0   2  Harshita   30
1   3     Naina   35


**Conclusion:**
Pandas DataFrames provide a structured and efficient way to handle and analyze tabular data. Their versatility in creating, loading, and manipulating data makes them a fundamental tool in the data scientist's toolbox.

#### Code to create CSV file with sample data

In [1]:
import pandas as pd

# Sample data
data = {
    'Name': ['Laxman', 'Rajesh', 'Ram'],
    'Age': [25, 30, 22],
    'City': ['Pune', 'Hyderabad', 'Mahabubnagar']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

# Read the CSV file back into a DataFrame
df_from_csv = pd.read_csv('data.csv')

# Print the DataFrame
print(df_from_csv)


     Name  Age          City
0  Laxman   25          Pune
1  Rajesh   30     Hyderabad
2     Ram   22  Mahabubnagar


### **`What does the below statement about Dataframes in Pandas means ?`**
- #### `Operations between DataFrames automatically align data based on their indices and columns, simplifying complex data manipulations.`

The statement about DataFrames in Pandas means that when you perform operations (like arithmetic operations, comparisons, or merging) between two Pandas DataFrames, Pandas will align the data based on their indices and column names. This alignment ensures that corresponding elements are matched correctly, which simplifies complex data manipulations and avoids common errors such as mismatched indices or missing data.

### Explanation with Examples:

Let's illustrate this with some examples:

#### Example 1: Arithmetic Operations

```python
import pandas as pd

# Creating two DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]})

# Adding the two DataFrames
df_sum = df1 + df2

print(df_sum)
```

Output:
```
    A   B
0  11  44
1  22  55
2  33  66
```

- In this example, `df1` and `df2` have the same indices and column names. When we add them (`df1 + df2`), Pandas aligns the corresponding elements based on their indices and columns (`A` and `B`). This alignment ensures that the addition operation is performed element-wise correctly.

#### Example 2: Data Alignment

```python
import pandas as pd

# Creating two DataFrames with different indices
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['X', 'Y', 'Z'])
df2 = pd.DataFrame({'A': [10, 20], 'B': [40, 50]}, index=['Y', 'Z'])

# Adding the two DataFrames
df_sum = df1 + df2

print(df_sum)
```

Output:
```
      A     B
X   NaN   NaN
Y  12.0  45.0
Z  23.0  56.0
```

- In this example, `df1` and `df2` have different indices (`df1` has ['X', 'Y', 'Z'] and `df2` has ['Y', 'Z']). When adding them (`df1 + df2`), Pandas aligns the data based on indices. Elements that do not have matching indices (like 'X' in `df1` and the second row in `df2`) result in NaN (Not a Number) values in the resulting DataFrame (`df_sum`). This alignment ensures that operations are performed correctly even when indices do not completely overlap.

#### Example 3: Data Alignment with Missing Columns

```python
import pandas as pd

# Creating two DataFrames with different columns
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [10, 20, 30], 'C': [40, 50, 60]})

# Adding the two DataFrames
df_sum = df1 + df2

print(df_sum)
```

Output:
```
    A   B   C
0  11 NaN NaN
1  22 NaN NaN
2  33 NaN NaN
```

- In this example, `df1` and `df2` have different columns (`df1` has columns 'A' and 'B', and `df2` has columns 'A' and 'C'). When adding them (`df1 + df2`), Pandas aligns the data based on both indices and columns. Columns that exist in one DataFrame but not in the other (like column 'B' in `df1` and column 'C' in `df2`) result in NaN values in the resulting DataFrame (`df_sum`). This alignment ensures that operations are performed correctly and consistently across DataFrames with potentially different structures.

### Conclusion:

The automatic alignment of data based on indices and column names in Pandas DataFrames simplifies complex data manipulations by ensuring that operations are performed correctly and efficiently. This behavior is one of the key advantages of using Pandas for data analysis and manipulation tasks, as it helps avoid manual alignment and potential errors in data handling.

### **Doubts :**

1. grouped_data = df.groupby('City')['Age'].mean()
2. What is the difference between df1.merge(df2) and df1 + df2 



#### Explain the statement `grouped_data = df.groupby('City')['Age'].mean()` in detail in below code

In [16]:
import pandas as pd
data = {
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
    'Age': [25, 30, 28, 35, 32]
}

df = pd.DataFrame(data)

grouped_data = df.groupby('City')

print(df)
print(grouped_data)

# Applying groupby and mean
grouped_data = df.groupby('City')['Age'].mean()
print(grouped_data)

          City  Age
0     New York   25
1  Los Angeles   30
2      Chicago   28
3     New York   35
4      Chicago   32
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11e1f4520>
City
Chicago        30.0
Los Angeles    30.0
New York       30.0
Name: Age, dtype: float64


#### Explanation
- <mark>df.groupby('City')</mark> : <mark>returns an object only</mark> and the output is <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11e1c6f50>
- [] : is used to select the columns from the data frame and not ()
- <mark>df.groupby('City')['Age'] </mark>: <mark>Selects Age column within each group</mark> from the groupby object. U can select mutiple column also
- <mark>df.groupby('City')['Age'].mean() </mark>: Applys mean on selected column i.e Age
- If a single column is returned than its in Pandas Series format and if more than one column than it returns Pandas Data Frame format (doubt)
- In Pandas, you often **chain methods** together <mark>to perform operations sequentially on DataFrames or Series</mark>. `The dot operator (.) is used to chain these methods and attributes.`
- `The dot operator (.) is used to chain these operations together sequentially`. It signifies that .mean() is applied to the result of df.groupby('City')['Age'], which is a Pandas Series containing 'Age' values within each group.
- This syntax correctly chains operations:
    - df.groupby('City'): Groups the DataFrame df by 'City'.
    - ['Age']: Selects the 'Age' column within each group.
    - .mean(): Calculates the mean of the 'Age' column within each group.


#### Explain what is the difference between 'merged_df = pd.merge(df1, df2, on='ID')' and 'df1 + df2' in the below code.

In [21]:
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Laxman', 'Harshita', 'Naina']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [30, 35, 40]})
merged_df = pd.merge(df1, df2, on='ID')

print(merged_df)
print('------------------')
print (df1 + df2)

   ID      Name  Age
0   2  Harshita   30
1   3     Naina   35
------------------
   Age  ID  Name
0  NaN   3   NaN
1  NaN   5   NaN
2  NaN   7   NaN


**Key Differences:**

`Operation Type`:
<mark>pd.merge(df1, df2, on='ID')</mark>: Performs a database-style join operation, <mark>merging rows based on a common column ('ID')</mark>.
<mark>df1 + df2</mark>: <mark>Attempts to perform element-wise addition</mark>, which is not valid for Pandas DataFrames <mark>unless columns match perfectly</mark>.

`Resulting DataFrame:`
pd.merge(df1, df2, on='ID') results in a new DataFrame (merged_df) with combined columns from both df1 and df2 based on matching 'ID' values.
df1 + df2 results in a DataFrame (df_addition) filled with NaN values, indicating the operation could not be performed.

`Purpose and Use Case:`
Use pd.merge() when you want to combine data based on a common column (like merging employee data with department data based on employee ID).
<mark>Avoid df1 + df2 for merging or joining purposes; it's used for element-wise arithmetic operations on matching columns in NumPy arrays or Pandas Series, `not DataFrames`.</mark>

#### How to Set own object type -> name for a Pandas Series ?

In [2]:
marks = [87, 98, 79, 91, 75]
sub = ['P', 'S', 'W', 'X', 'J']

pd.Series(marks, index = sub)
pd.Series(marks, index = sub, name = "Student marks")

P    87
S    98
W    79
X    91
J    75
Name: Student marks, dtype: int64

#### What are the different Pandas Series Attributes available ?

In [3]:
import pandas as pd
# Size -> how many values are in series

marks = {
    'p' : 83,
    's' : 89,
    'x' : 71,
    'y' : 71,
    'k' : 94
}

mark_series = pd.Series(marks, name = "your mark")

print(mark_series)

print('----------------------')

# size -> total element
mark_series.size
print('mark_series.size : ', mark_series.size)

# dtype
mark_series.dtype
print('mark_series.dtype : ', mark_series.dtype)

# name 
mark_series.name
print('mark_series.name : ', mark_series.name)

# is_unique (False if item repeat)
mark_series.is_unique
pd.Series([1, 2, 1, 3, 4, 2, 4, 5]).is_unique
print('mark_series.is_unique : ', mark_series.is_unique)

# index (return all index)
mark_series.index
print('mark_series.index : ', mark_series.index)

# values
mark_series.values
print('mark_series.values : ', mark_series.values)

p    83
s    89
x    71
y    71
k    94
Name: your mark, dtype: int64
----------------------
mark_series.size :  5
mark_series.dtype :  int64
mark_series.name :  your mark
mark_series.is_unique :  False
mark_series.index :  Index(['p', 's', 'x', 'y', 'k'], dtype='object')
mark_series.values :  [83 89 71 71 94]


#### How to Convert a DataFrame with single column to Series ?

In [6]:
import pandas as pd  # Import pandas library

# Sample data creation (for illustration, normally this would be read from a CSV)
data = {'match_no': [1, 2, 3, 4, 5], 'runs': [45, 67, 23, 89, 34]}
kohli_data = pd.DataFrame(data).set_index('match_no')

print('# Display the DataFrame :')
print(kohli_data)

# Alternatively, reading from the CSV file
# kohli_data = pd.read_csv('/kaggle/input/dataset-for-pandas-series/kohli_ipl.csv', index_col="match_no")

# Convert DataFrame to Series if it has only one column
vk = kohli_data.squeeze(True)
print('--------------')

print('vk.size : ',vk.size)
print('vk.index : ',vk.index)
print('vk.values : ',vk.values)

# Check the type of vk to ensure it is a Series
print('type(vk) : ',type(vk))

print('--------------')
# Display the Series
print('# Display the Series')
print(vk)


# Display the DataFrame :
          runs
match_no      
1           45
2           67
3           23
4           89
5           34
--------------
vk.size :  5
vk.index :  Int64Index([1, 2, 3, 4, 5], dtype='int64', name='match_no')
vk.values :  [45 67 23 89 34]
type(vk) :  <class 'pandas.core.series.Series'>
--------------
# Display the Series
match_no
1    45
2    67
3    23
4    89
5    34
Name: runs, dtype: int64


#### In python where df is a DataFrame object, while filtering data why do we give syntax as filtered_data = df[df['Age']>30] and not as filtered_data = df['Age'>30] since to access a column we simply give df['Age'] and not df[df['Age']]. Please explain


1. **Accessing a Column:**
   To access a single column in a DataFrame, you use the syntax `df['column_name']`. This returns a Series object representing the column:
   ```python
   age_column = df['Age']
   ```

2. **Filtering Rows Based on a Condition:**
   When you filter rows, you want to return a subset of the DataFrame where a certain condition is true. <mark>The syntax `df['Age'] > 30` creates a boolean Series that has the same index as `df`, where each value is `True` if the condition is met and `False` otherwise:</mark>
   ```python
   condition = df['Age'] > 30
   ```

3. **Applying the Filter:**
<mark>   To filter the DataFrame based on this condition, you use this boolean Series inside the DataFrame's indexing brackets. This tells `pandas` to return only the rows where the condition is `True`:</mark>
   ```python
   filtered_data = df[condition]
   ```

<mark>Putting it all together, the full expression `df[df['Age'] > 30]` can be broken down as:
- `df['Age']` accesses the 'Age' column.
- `df['Age'] > 30` creates a boolean Series where the condition is evaluated for each row.
- `df[boolean_series]` filters the DataFrame based on the boolean Series.</mark>

###<mark> Why not `df['Age'>30]`?</mark>
The syntax `df['Age'>30]` is invalid for a couple of reasons:
1. **String Literal:** `'Age'>30` is interpreted as trying to compare the string `'Age'` with the number `30`, which doesn't make sense and will result in an error.
2. **Indexing Syntax:** `df['something']` expects `something` to be either a column name (to access a column) or a boolean Series (to filter rows). The expression `'Age'>30` does not fit either of these expectations.

### Summary:
- `df['Age']` accesses the 'Age' column.
- `df['Age'] > 30` creates a boolean Series.
- `df[boolean_series]` filters the DataFrame rows based on the boolean Series.


#### What is a boolean series ?

A boolean Series in `pandas` is a Series object where each element is either `True` or `False`. It is typically created by applying a comparison or logical operation to a `pandas` Series. This boolean Series can then be used to filter data in a DataFrame.


**Example:**

Suppose we have the following DataFrame:

```python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 32, 18, 45, 30],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}
df = pd.DataFrame(data)
print(df)
```

Output:
```
      Name  Age         City
0    Alice   25     New York
1      Bob   32  Los Angeles
2  Charlie   18      Chicago
3    David   45      Houston
4      Eve   30      Phoenix
```

##### Creating a Boolean Series:

We can create a boolean Series by applying a condition to one of the columns. For instance, let's create a boolean Series to check which rows have 'Age' greater than 30:

```python
<mark>age_condition = df['Age'] > 30
print(age_condition)</mark>
```

Output:
```
0    False
1     True
2    False
3     True
4    False
Name: Age, dtype: bool
```

<mark>Here, `age_condition` is a boolean Series where each value indicates whether the corresponding row in the 'Age' column satisfies the condition `Age > 30`.</mark>

##### Using the Boolean Series to Filter Data:

We can use this boolean Series to filter the DataFrame and get only the rows where 'Age' is greater than 30:

```python
filtered_df = df[age_condition]
print(filtered_df)
```

Output:
```
    Name  Age         City
1    Bob   32  Los Angeles
3  David   45      Houston
```

##### Summary:
1. **Boolean Series Creation:** Applying a condition to a DataFrame column creates a boolean Series.
2. **Boolean Series Example:** In our example, `df['Age'] > 30` creates a boolean Series where each element is `True` if the condition is met, and `False` otherwise.
3. **Filtering Data:** This boolean Series can be used to filter the DataFrame, returning only the rows where the condition is `True`.

This process allows for powerful and flexible data manipulation in `pandas`.

#### What is a pd.Series()

- In the given code, pd.Series refers to the Series class in the pandas library. 
- A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
- It is similar to a column in a DataFrame or a list, but with additional features.

#### Can we perform merge operation on three dataframes ?

In [7]:
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Laxman', 'Harshita', 'Naina']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [30, 35, 40]})
df3 = pd.DataFrame({'ID': [3, 4, 5], 'Sex': ['M','F','F']})
# merged_df = pd.merge(df1, df2, df3, on='ID')

# print(merged_df)
print('------------------')
print (df1 + df2 + df3)



------------------
   Age  ID  Name  Sex
0  NaN   6   NaN  NaN
1  NaN   9   NaN  NaN
2  NaN  12   NaN  NaN


#### Note :
- `merged_df = pd.merge(df1, df2, df3, on='ID')` : gives an error becaue the sytax for merge is merge(left, right, how=how, on=on ....)
- `df1 + df2 + df3` : Addition can be performed on two ore more Dataframes also

#### How to choose a particular column as index while reading a csv file ?

In [8]:
import pandas as pd

# Sample data
data = {
    'Name': ['Laxman', 'Rajesh', 'Ram'],
    'Age': [25, 30, 22],
    'City': ['Pune', 'Hyderabad', 'Mahabubnagar']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

# Read the CSV file back into a DataFrame
df_from_csv = pd.read_csv('data.csv')

# Print the DataFrame without Index
print('df_from_csv : \n',df_from_csv)

# Read the CSV file back into a DataFrame and choose a particular column as Index
df_from_csv_with_index = pd.read_csv('data.csv', index_col='Age')

# Print the DataFrame
print('\n\n df_from_csv_with_index : \n',df_from_csv_with_index)

df_from_csv : 
      Name  Age          City
0  Laxman   25          Pune
1  Rajesh   30     Hyderabad
2     Ram   22  Mahabubnagar


 df_from_csv_with_index : 
        Name          City
Age                      
25   Laxman          Pune
30   Rajesh     Hyderabad
22      Ram  Mahabubnagar


### **`PANDAS SERIES METHODS`**


#### Subtopic: Normal Methods

| Method | Syntax | Return Type | Input Parameters | In-place or Copy | One Line Explainer | Peculiarities / Considerations |
|--------|--------|-------------|------------------|------------------|--------------------|-------------------------------|
| `head()` | `series.head(n=5)` | Series | `n` (int, default 5) | Copy | Returns the first `n` rows. | Useful for quickly viewing the top rows. |
| `tail()` | `series.tail(n=5)` | Series | `n` (int, default 5) | Copy | Returns the last `n` rows. | Useful for quickly viewing the bottom rows. |
| `sample()` | `series.sample(n=None, frac=None, random_state=None)` | Series | `n` (int, optional), `frac` (float, optional), `random_state` (int, optional) | Copy | <mark>**Returns a random sample of items.**</mark> | <mark>**Use `frac` for a fraction of items, `n` for a fixed number.**</mark> |
| `value_counts()` | `series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)` | Series | Various optional parameters | Copy | <mark>**Returns counts of unique values.**</mark> | <mark>**Useful for frequency analysis.**</mark> |
| `sort_values()` | `series.sort_values(ascending=True)` | Series | <mark>**`ascending` (bool, default True), `inplace` (bool, default False)**</mark> | <mark>Copy (or In-place) </mark>| Sorts values in the Series. | Can sort in ascending or descending order. |
| `sort_index()` | `series.sort_index(ascending=True, inplace=False)` | Series | <mark>`ascending` (bool, default True), `inplace` (bool, default False) </mark>| <mark>Copy (or In-place)</mark> | Sorts Series by index. | <mark>**Useful for reordering based on index.**</mark>|


#### **`Note :`** 
1. Return type is Series for all the methods
2. All are copy, in case of sorting by default in-place is False


In [9]:
#Dataset
import pandas as pd

# Sample data
data = {
    'numbers': [4, 2, 7, 1, 8, 3, 2, 4, 7, 9, 6, 5]
}

# Create a Series
series = pd.Series(data['numbers'])
print("Original Series:\n", series)

Original Series:
 0     4
1     2
2     7
3     1
4     8
5     3
6     2
7     4
8     7
9     9
10    6
11    5
dtype: int64


#### Program demonstrating Normal Methods:

In [11]:
print("Original Series:\n", series)

# Normal Methods
print("\nHead (first 5 elements):")
print(series.head())

print("\nTail (last 5 elements):")
print(series.tail())

print("\nRandom Sample (3 elements):")
print(series.sample(3))

print("\nValue Counts:")
print(series.value_counts())

print("\nSort Values (ascending):")
print(series.sort_values())

print("\nSort Values (descending):")
print(series.sort_values(ascending=False))

print("\nSort Values (descending) and get the highest value:")
print(series.sort_values(ascending=False).head(1).values[0])

print("\nSort Index:")
print(series.sort_index())

print("\nSort Index In-place:")
series.sort_index(inplace=True)
print(series)

Original Series:
 0     4
1     2
2     7
3     1
4     8
5     3
6     2
7     4
8     7
9     9
10    6
11    5
dtype: int64

Head (first 5 elements):
0    4
1    2
2    7
3    1
4    8
dtype: int64

Tail (last 5 elements):
7     4
8     7
9     9
10    6
11    5
dtype: int64

Random Sample (3 elements):
0    4
6    2
3    1
dtype: int64

Value Counts:
4    2
2    2
7    2
1    1
8    1
3    1
9    1
6    1
5    1
dtype: int64

Sort Values (ascending):
3     1
1     2
6     2
5     3
0     4
7     4
11    5
10    6
2     7
8     7
4     8
9     9
dtype: int64

Sort Values (descending):
9     9
4     8
2     7
8     7
10    6
11    5
0     4
7     4
5     3
1     2
6     2
3     1
dtype: int64

Sort Values (descending) and get the highest value:
9

Sort Index:
0     4
1     2
2     7
3     1
4     8
5     3
6     2
7     4
8     7
9     9
10    6
11    5
dtype: int64

Sort Index In-place:
0     4
1     2
2     7
3     1
4     8
5     3
6     2
7     4
8     7
9     9
10    6
11    5
d

#### Subtopic: Math Methods

| Method | Syntax | Return Type | Input Parameters | In-place or Copy | One Line Explainer | Peculiarities / Considerations |
|--------|--------|-------------|------------------|------------------|--------------------|-------------------------------|
| `count()` | `series.count()` | int | None | Copy | <mark>**Returns number of non-NA/null values.**</mark> | <mark>**Useful for knowing non-missing entries.**</mark> |
| `sum()` | `series.sum()` | scalar | None | Copy | Returns the sum of values. | <mark>**Ignores non-numeric values.**</mark>|
| `product()` | `series.product()` | scalar | None | Copy | Returns the product of values. | <mark>**Useful for cumulative multiplication.**</mark> |





#### Program demonstrating Math Methods:

In [12]:
print("Original Series:\n", series)

# Math Methods
print("\nCount of non-null values:")
print(series.count())

print("\nSum of values:")
print(series.sum())

print("\nProduct of values:")
print(series.product())

Original Series:
 0     4
1     2
2     7
3     1
4     8
5     3
6     2
7     4
8     7
9     9
10    6
11    5
dtype: int64

Count of non-null values:
12

Sum of values:
58

Product of values:
20321280


#### Subtopic: Statistical Methods

| Method | Syntax | Return Type | Input Parameters | In-place or Copy | One Line Explainer | Peculiarities / Considerations |
|--------|--------|-------------|------------------|------------------|--------------------|-------------------------------|
| `mean()` | `series.mean()` | scalar | None | Copy | Returns the mean of values. | <mark>**Ignores non-numeric values.**</mark> |
| `median()` | `series.median()` | scalar | None | Copy | Returns the median of values. | Provides the middle value. |
| `mode()` | `series.mode()` | <mark>**Series**</mark> | None | Copy | Returns the mode(s) of values. | <mark>**Can return multiple modes.**</mark>|
| `std()` | `series.std()` | scalar | None | Copy | Returns the standard deviation. | <mark>**Measures data spread.**</mark> |
| `var()` | `series.var()` | scalar | None | Copy | Returns the variance. | <mark>Another measure of spread.</mark> |
| `min()` | `series.min()` | scalar | None | Copy | Returns the minimum value. | Finds the smallest value. |
| `max()` | `series.max()` | scalar | None | Copy | Returns the maximum value. | Finds the largest value. |
| `describe()` | `series.describe()` | <mark>**Series**</mark> | None | Copy | <mark>Returns summary statistics. </mark>| <mark>Provides a quick overview.</mark> |

#### Note :
1. The return type of <mark>`both mode() and describe() is 'Series'`</mark>
2. <mark>Mode can return multiple values</mark>
3. <mark>Mean ignores non-numeric value</mark>

#### Program demonstrating Statistical Methods:

In [13]:
print("Original Series:\n", series)

# Statistical Methods
print("\nMean of values:")
print(series.mean())

print("\nMedian of values:")
print(series.median())

print("\nMode of values:")
print(series.mode())

print("\nStandard Deviation of values:")
print(series.std())

print("\nVariance of values:")
print(series.var())

print("\nMinimum value:")
print(series.min())

print("\nMaximum value:")
print(series.max())

print("\nDescribe:")
print(series.describe())

Original Series:
 0     4
1     2
2     7
3     1
4     8
5     3
6     2
7     4
8     7
9     9
10    6
11    5
dtype: int64

Mean of values:
4.833333333333333

Median of values:
4.5

Mode of values:
0    2
1    4
2    7
dtype: int64

Standard Deviation of values:
2.5878504008094625

Variance of values:
6.696969696969696

Minimum value:
1

Maximum value:
9

Describe:
count    12.000000
mean      4.833333
std       2.587850
min       1.000000
25%       2.750000
50%       4.500000
75%       7.000000
max       9.000000
dtype: float64
