# panda 
Panda is a powerful data manipulation library in Python that provides data structures and functions for working with structured data. It is widely used for data analysis, cleaning, and transformation tasks.


#### Bash to install pandas

```bash       
pip install pandas
```    

#### Importing pandas 
```python     
import pandas as pd
```


In [3]:
! pip install pandas  
# in google colab, pandas is already installed, so this line is not necessary. 
# You can comment it out or remove it if you're running the code in an environment where pandas is already available.

Defaulting to user installation because normal site-packages is not writeable


In [4]:
import pandas as pd 
pd.__version__ 


'2.3.3'

# Data Structures in pandas
---

The primary data structure in pandas are Series and DataFrame. 
- A Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a database table.
- A DataFrame is a two-dimensional labeled data structure that can hold multiple Series. It is similar to a table in a spreadsheet or a database.


## Series

important functions for Series:
- `pd.Series(data, index)`: Create a Series from a list, array, or dictionary. The `index` parameter is optional and can be used to specify custom labels for the Series.
- `series.head(n)`: Return the first `n` elements of the Series.             
- `series.tail(n)`: Return the last `n` elements of the Series.
- `series.describe()`: Generate descriptive statistics of the Series, including count, mean, standard deviation, minimum, and maximum values.
- `series.value_counts()`: Return a Series containing counts of unique values in the Series.      
- `series.unique()`: Return an array of unique values in the Series.
- `series.isnull()`: Return a boolean Series indicating which values are null (missing).   
- `series.notnull()`: Return a boolean Series indicating which values are not null (not missing).
- `series.fillna(value)`: Fill missing values in the Series with a specified value. 
- `series.dropna()`: Return a Series with missing values removed.     
- `series.astype(dtype)`: Convert the data type of the Series to a specified type.  
- `series.sort_values()`: Sort the values in the Series in ascending order.
- `series.sort_index()`: Sort the Series by its index labels.  
- `series.apply(func)`: Apply a function to each element in the Series.      
- `series.map(func)`: Map a function to each element in the Series, returning a new Series with the transformed values.


---

### pd.series 
The `pd.Series()` function in pandas is used to create a Series, which is a one-dimensional labeled array that can hold any data type. The syntax for creating a Series is as follows:

```python
pd.Series(data, index=None, dtype=None, name=None, copy=False, fastpath=False)
```
- `data`: The data to be stored in the Series. It can be a list, array, or dictionary.
- `index`: An optional parameter that specifies custom labels for the Series. If not provided, the default index will be a range of integers starting from 0.
- `dtype`: An optional parameter that specifies the data type of the Series. If not provided, pandas will infer the data type based on the input data.
- `name`: An optional parameter that assigns a name to the Series.    
- `copy`: An optional parameter that indicates whether to copy the input data. The default is `False`.
- `fastpath`: An optional parameter that is used internally by pandas for optimization purposes. It is not typically used by users when creating a Series.       


pd.Series() is a fundamental function in pandas that allows you to create a Series object, which is a powerful data structure for handling one-dimensional data with labels.

## DataFrame 

important functions for DataFrame:
- `pd.DataFrame(data, columns)`: Create a DataFrame from a list of lists, a dictionary of lists, or a NumPy array. The `columns` parameter is optional and can be used to specify custom column labels.
- `df.head(n)`: Return the first `n` rows of the DataFrame.    
- `df.tail(n)`: Return the last `n` rows of the DataFrame.
- `df.describe()`: Generate descriptive statistics of the DataFrame, including count, mean, standard deviation, minimum, and maximum values for each column.
- `df.info()`: Print a concise summary of the DataFrame, including the number of non-null values and data types of each column.
- `df.value_counts()`: Return a Series containing counts of unique values for each column in the DataFrame.
- `df.isnull()`: Return a DataFrame of boolean values indicating which values are null (missing).
- `df.notnull()`: Return a DataFrame of boolean values indicating which values are not null (not missing).
- `df.fillna(value)`: Fill missing values in the DataFrame with a specified value.
- `df.dropna()`: Return a DataFrame with missing values removed.      
- `df.astype(dtype)`: Convert the data type of the DataFrame to a specified type.
- `df.sort_values(by)`: Sort the DataFrame by the values in a specified column.
- `df.sort_index()`: Sort the DataFrame by its index labels.          
- `df.apply(func)`: Apply a function to each column or row in the DataFrame.
- `df.map(func)`: Map a function to each element in the DataFrame, returning a new DataFrame with the transformed values.     
- `df.groupby(by)`: Group the DataFrame by a specified column and perform aggregate operations on the groups.   
- `df.merge(other, on)`: Merge the DataFrame with another DataFrame based on a common column.     
- `df.concat([df1, df2])`: Concatenate two or more DataFrames along a specified axis (rows or columns).  



---

### pd.dataframe 
> The `pd.DataFrame()` function in pandas is used to create a DataFrame, which is a two-dimensional labeled data structure. You can create a DataFrame from various data sources, such as lists, dictionaries, or NumPy arrays. The syntax for creating a DataFrame is as follows:

```python     
pd.DataFrame(data, index=None, columns=None, dtype=None, copy=False)
```     

Where: 
- `data`: The data to be stored in the DataFrame. It can be a list, dictionary, NumPy array, or another DataFrame.
- `index`: Optional. The index (row labels) for the DataFrame. If not provided, it will default to a range of integers starting from 0.
- `columns`: Optional. The column labels for the DataFrame. If not provided, it will default to a range of integers starting from 0.
- `dtype`: Optional. The data type for the DataFrame. If not provided, it will infer the data type based on the input data.
- `copy`: Optional. Whether to copy the input data. If False, it will try to avoid copying data when possible.  


In [5]:
# For example, you can create a DataFrame from a dictionary like this:


data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)     
print(df)


import numpy as np

print("=============================================================================================/n")

print(np.random.rand(5)) 
# this will generate an array of 5 random numbers between 0 and 1.
print(np.random.rand(5, 3)) 
# this will generate a 5x3 array of random numbers between 0 and 1. 


print("=============================================================================================/n")
another_df = pd.DataFrame(
       np.random.rand(5, 3), 
       columns=['A', 'B', 'C']   , 
       index=['Row1', 'Row2', 'Row3', 'Row4', 'Row5'] 
)
print(another_df)


print("\n=============================================================================================/n")
another_df_oo1 = pd.DataFrame(
       np.random.rand(100, 3),       
       columns= ['A', 'B', 'C']   ,  
       index=['Row' + str(i) for i in range(1, 101)] 
)
print(another_df_oo1)



      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
[0.39463651 0.86544034 0.90464425 0.69269842 0.73749445]
[[0.54815413 0.83272986 0.42973423]
 [0.57202339 0.28547086 0.33570396]
 [0.291345   0.98882333 0.75460012]
 [0.88531799 0.7065427  0.83908782]
 [0.70028802 0.50494324 0.27428038]]
             A         B         C
Row1  0.197756  0.981331  0.430920
Row2  0.130629  0.277852  0.241414
Row3  0.758583  0.852297  0.608803
Row4  0.181559  0.279305  0.054162
Row5  0.249192  0.385619  0.796207

               A         B         C
Row1    0.793116  0.085455  0.703747
Row2    0.898628  0.609830  0.723686
Row3    0.640356  0.942022  0.530928
Row4    0.374699  0.243800  0.840748
Row5    0.801885  0.922816  0.333638
...          ...       ...       ...
Row96   0.354968  0.096967  0.498116
Row97   0.456906  0.881789  0.648464
Row98   0.938974  0.435014  0.874090
Row99   0.211206  0.774659  0.879940
Row100  0.906547  0.115935  0.025173

[100 rows x 3 columns]


In [6]:
from sklearn.datasets import load_diabetes 
# we are importing the diabetes dataset from sklearn, which is a commonly used dataset for regression tasks.
# The load_diabetes function returns a dictionary-like object that contains the data and target values. 


diabetes = load_diabetes() 
# this will return a numpy array containing the features of the diabetes dataset. 
print("Type of diabetes: " + str(type(diabetes)))
print("Type of diabetes.data: " + str(type(diabetes.data)))    



diabetes_for_panda = load_diabetes(as_frame=True)
# The as_frame=True argument tells the function to return the data as a pandas DataFrame instead of a NumPy array.
print("Type of diabetes_for_panda: " + str(type(diabetes_for_panda)))
print("Type of diabetes_for_panda.data: " + str(type(diabetes_for_panda.data)))
 

print("\n=============================================================================================\n")
print("\n==== diabetes.data ====\n")
print("===============================================================================================\n")

print(diabetes.data ) 




diabetes_df = diabetes_for_panda.data
print("\n=============================================================================================")
print("\n==== diabetes_for_panda.data ====\n")
print("===============================================================================================\n")
print(diabetes_df)

print("\n=============================================================================================")
print("\n==== directly typing the diabetes_df ====\n")
print("===============================================================================================\n")
diabetes_df

Type of diabetes: <class 'sklearn.utils._bunch.Bunch'>
Type of diabetes.data: <class 'numpy.ndarray'>
Type of diabetes_for_panda: <class 'sklearn.utils._bunch.Bunch'>
Type of diabetes_for_panda.data: <class 'pandas.core.frame.DataFrame'>



==== diabetes.data ====


[[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990749
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06833155
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286131
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04688253
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452873
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00422151
   0.00306441]]


==== diabetes_for_panda.data ====


          age       sex       bmi        bp        s1        s2        s3  \
0    0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1   -0.001882 -0.044642 -0.051474 -0.026328 -0.00

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


## Common DataFrame // Series Operations


---


### rename() function in pandas
> The `rename()` function in pandas is used to rename the labels of rows or columns in a DataFrame. You can use it to change the names of columns or index labels. The syntax for the `rename()` function is as follows:

```python
dataframe.rename(mapper=None, index=None, columns=None, axis=None, inplace=False)
```     

Where:
- `mapper`: A dictionary or function to map old labels to new labels.
- `index`: A dictionary or function to map old index labels to new index labels.
- `columns`: A dictionary or function to map old column labels to new column labels.
- `axis`: The axis along which to rename. Use `0` for index and `1` for columns. The default is `None`.
- `inplace`: If `True`, the operation will be performed in place and the original DataFrame will be modified. If `False`, a new DataFrame with the renamed labels will be returned. The default is `False`.       



### shape 
> The `shape` attribute in pandas is used to get the dimensions of a DataFrame or Series. It returns a tuple representing the number of rows and columns in a DataFrame, or the number of elements in a Series. The syntax for using the `shape` attribute is as follows:

```python     
dataframe.shape
```     

Where `dataframe` is the DataFrame or Series for which you want to get the dimensions. The output will be a tuple, where the first element represents the number of rows and the second element represents the number of columns (for DataFrames) or the number of elements (for Series).      


### head() and tail() 




> head() function in pandas

The `head()` function in pandas is used to display the first few rows of a DataFrame. By default, it shows the first 5 rows, but you can specify the number of rows to display by passing an integer as an argument.

```python
series.head()  # Displays the first 5 elements of the Series
series.head(10)  # Displays the first 10 elements of the Series
series.head(n)  # Displays the first n elements of the Series, where n is an integer  

dataframe.head()  # Displays the first 5 rows of the DataFrame
dataframe.head(10)  # Displays the first 10 rows of the DataFrame
dataframe.head(n)  # Displays the first n rows of the DataFrame, where n is an integer
```

> tail() function in pandas

The `tail()` function in pandas is used to display the last few rows of a DataFrame. By default, it shows the last 5 rows, but you can specify the number of rows to display by passing an integer as an argument.

```python   
series.tail()  # Displays the last 5 elements of the Series
series.tail(10)  # Displays the last 10 elements of the Series 
series.tail(n)  # Displays the last n elements of the Series, where n is an integer

dataframe.tail()  # Displays the last 5 rows of the DataFrame
dataframe.tail(10)  # Displays the last 10 rows of the DataFrame
dataframe.tail(n)  # Displays the last n rows of the DataFrame, where n is an integer
```    

---

In [7]:
print(diabetes_df.head(5) )
print(diabetes_df.tail(5))

        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  
0 -0.002592  0.019907 -0.017646  
1 -0.039493 -0.068332 -0.092204  
2 -0.002592  0.002861 -0.025930  
3  0.034309  0.022688 -0.009362  
4 -0.002592 -0.031988 -0.046641  
          age       sex       bmi        bp        s1        s2        s3  \
437  0.041708  0.050680  0.019662  0.059744 -0.005697 -0.002566 -0.028674   
438 -0.005515  0.050680 -0.015906 -0.067642  0.049341  0.079165 -0.028674   
439  0.041708  0.050680 -0.015906  0.017293 -0.037344 -0.013840 -0.024993   
440 -0.045472 -0.044642  0.039062  0.

### info() 

info() function in pandas is used to get a concise summary of a DataFrame. It provides information about the DataFrame's structure, including the number of non-null entries, data types of each column, and memory usage.

When you call `dataframe.info()`, it will output a summary that includes:    
- The number of entries (rows) in the DataFrame.
- The number of columns in the DataFrame.
- The number of non-null entries in each column.
- The data type of each column (e.g., int64, float64, object).
- The memory usage of the DataFrame.

This function is particularly useful for quickly understanding the structure of your data and identifying any missing values or data type issues.  




---

In [8]:
print("\n=============================================================================================")
print("==== outout of diabetes_df.info() ====")
print("===============================================================================================\n")
print(diabetes_df.info())




==== outout of diabetes_df.info() ====

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
dtypes: float64(10)
memory usage: 34.7 KB
None


### describe() 
> describe() function in pandas is used to generate descriptive statistics of a DataFrame. It provides a summary of the central tendency, dispersion, and shape of the dataset's distribution, excluding NaN values.


When you call `dataframe.describe()`, it will output a summary that includes:    
- Count: The number of non-null entries in each column.
- Mean: The average value of each column.
- Standard Deviation (std): A measure of the amount of variation or dispersion in the data.
- Minimum (min): The smallest value in each column.   
- 25th Percentile (25%): The value below which 25% of the data falls.
- 50th Percentile (50%): The median value of each column.
- 75th Percentile (75%): The value below which 75% of the data falls.
- Maximum (max): The largest value in each column. 

This function is useful for quickly understanding the distribution and characteristics of your data, especially for numerical columns. It can help you identify outliers, understand the range of values, and get a sense of the overall distribution of the data. 

dataframe.describe(percentiles=[0.1, 0.5, 0.9])
> You can also specify custom percentiles by passing a list of values to the `percentiles` parameter. For example, `dataframe.describe(percentiles=[0.1, 0.5, 0.9])` will include the 10th, 50th (median), and 90th percentiles in the output.  


dataframe.describe(include='all')
> you can also use `dataframe.describe(include='all')` to include all columns, including non-numeric ones, in the summary statistics. This will provide additional information such as the number of unique values, the most frequent value (top), and the frequency of the most frequent value (freq) for non-numeric columns. 


When you call `dataframe.describe(include='all')`, it will output a summary that includes:
- For numeric columns: count, mean, std, min, 25%, 50%, 75%, and max.
- For non-numeric columns: count, unique (number of unique values), top (most frequent value), and freq (frequency of the most frequent value).    

This extended version of the describe function is particularly useful when you have a mix of numeric and categorical data in your DataFrame, as it provides insights into both types of data. 





In [9]:

print("\n=============================================================================================")
print("==== outout of diabetes_df.describe() ====")
print("===============================================================================================\n")
print(diabetes_df.describe())

print("\n=============================================================================================")
print("==== outout of diabetes_df.describe(percentiles=[0.1, 0.5, 0.9]) ====")
print("===============================================================================================\n")
print(diabetes_df.describe(percentiles=[0.1, 0.5, 0.9]))       


print("\n=============================================================================================")
print("==== outout of diabetes_df.describe(include='all') ====")
print("===============================================================================================\n")
print(diabetes_df.describe(include='all'))



==== outout of diabetes_df.describe() ====

                age           sex           bmi            bp            s1  \
count  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02   
mean  -2.511817e-19  1.230790e-17 -2.245564e-16 -4.797570e-17 -1.381499e-17   
std    4.761905e-02  4.761905e-02  4.761905e-02  4.761905e-02  4.761905e-02   
min   -1.072256e-01 -4.464164e-02 -9.027530e-02 -1.123988e-01 -1.267807e-01   
25%   -3.729927e-02 -4.464164e-02 -3.422907e-02 -3.665608e-02 -3.424784e-02   
50%    5.383060e-03 -4.464164e-02 -7.283766e-03 -5.670422e-03 -4.320866e-03   
75%    3.807591e-02  5.068012e-02  3.124802e-02  3.564379e-02  2.835801e-02   
max    1.107267e-01  5.068012e-02  1.705552e-01  1.320436e-01  1.539137e-01   

                 s2            s3            s4            s5            s6  
count  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02  4.420000e+02  
mean   3.918434e-17 -5.777179e-18 -9.042540e-18  9.268604e-17  1.130318e-17  
std    4.

### Transpose of a DataFrame
> The transpose of a DataFrame in pandas is obtained using the `transpose()` method or the `.T` attribute. Transposing a DataFrame means swapping its rows and columns. 

When you transpose a DataFrame, the rows become columns and the columns become rows. This can be useful for various reasons, such as changing the orientation of the data for better visualization or analysis.



In [10]:
print("\n=============================================================================================")
print("==== outout of diabetes_df.T ====")       
print("===============================================================================================\n")
print(diabetes_df.T) 




==== outout of diabetes_df.T ====

          0         1         2         3         4         5         6    \
age  0.038076 -0.001882  0.085299 -0.089063  0.005383 -0.092695 -0.045472   
sex  0.050680 -0.044642  0.050680 -0.044642 -0.044642 -0.044642  0.050680   
bmi  0.061696 -0.051474  0.044451 -0.011595 -0.036385 -0.040696 -0.047163   
bp   0.021872 -0.026328 -0.005670 -0.036656  0.021872 -0.019442 -0.015999   
s1  -0.044223 -0.008449 -0.045599  0.012191  0.003935 -0.068991 -0.040096   
s2  -0.034821 -0.019163 -0.034194  0.024991  0.015596 -0.079288 -0.024800   
s3  -0.043401  0.074412 -0.032356 -0.036038  0.008142  0.041277  0.000779   
s4  -0.002592 -0.039493 -0.002592  0.034309 -0.002592 -0.076395 -0.039493   
s5   0.019907 -0.068332  0.002861  0.022688 -0.031988 -0.041176 -0.062917   
s6  -0.017646 -0.092204 -0.025930 -0.009362 -0.046641 -0.096346 -0.038357   

          7         8         9    ...       432       433       434  \
age  0.063504  0.041708 -0.070900  ...  0.00


# columns attribute
> The `columns` attribute in pandas is used to access or modify the column labels of a DataFrame. It returns an Index object containing the column names of the DataFrame. You can also assign a new list of column names to this attribute to rename the columns of the DataFrame.     
> For example, if you have a DataFrame `df` and you want to rename its columns, you can do so by assigning a new list of column names to the `columns` attribute:

```python     
df.columns  # This will return the current column names of the DataFrame
df.columns = ['new_col1', 'new_col2', 'new_col3']
df.columns  # This will now return the new column names
```


In [11]:

diabetes_df["age"] # this will return the "age" column of the diabetes_df DataFrame as a pandas Series. 
diabetes_df["age"].head(5) # this will return the first 5 values of the "age" column. 
# diabetes_df["age", "sex"] 
# this will return an error because the correct way to select multiple columns is to use a list of column names.     
diabetes_df[["age", "sex"]] # this will return the "age" and "sex" columns of the diabetes_df DataFrame as a new DataFrame.     
       


Unnamed: 0,age,sex
0,0.038076,0.050680
1,-0.001882,-0.044642
2,0.085299,0.050680
3,-0.089063,-0.044642
4,0.005383,-0.044642
...,...,...
437,0.041708,0.050680
438,-0.005515,0.050680
439,0.041708,0.050680
440,-0.045472,-0.044642


# Data selection in pandas
In pandas, you can select data from a DataFrame using various methods, including `.loc`, `.iloc`, and boolean indexing. These methods allow you to access specific rows and columns based on labels, integer positions, or conditions. 




## .loc and .iloc 
> In pandas, `.loc` and `.iloc` are two different methods used for indexing and selecting data from a DataFrame.
>
> - `.loc` is label-based, meaning that you use the labels of the rows and columns to select data. It allows you to select data based on the index labels and column names.
> - `.iloc` is integer position-based, meaning that you use the integer positions of the rows and columns to select data. It allows you to select data based on the numerical index of the rows and columns.      

### .loc
The `.loc` method is used for label-based indexing. You can use it to select rows and columns by their labels. The syntax for using `.loc` is as follows:

```python     
dataframe.loc[row_labels, column_labels]
```           



### .iloc
The `.iloc` method is used for integer position-based indexing. You can use it to select rows and columns by their integer positions. The syntax for using `.iloc` is as follows:

```python     
dataframe.iloc[row_positions, column_positions]
```    




---

In [12]:

print("\n=============================================================================================")
print("==== outout of diabetes_df.loc[0] ====")  
print("===============================================================================================\n")
print(diabetes_df.loc[0]) 
# this will return the first row of the diabetes_df DataFrame as a pandas Series.


print("\n=============================================================================================")
print("==== outout of diabetes_df.loc[0:5] ====")       
print("===============================================================================================\n")
print(diabetes_df.loc[0:5]) 
# this will return the rows from index 0 to 5 (inclusive) of the diabetes_df DataFrame as a new DataFrame.      


print("\n=============================================================================================")
print("==== outout of diabetes_df.loc[0:5, ['age', 'sex']] ====")       
print("===============================================================================================\n")      
print(diabetes_df.loc[0:5, ["age", "sex"]])      
# this will return the rows from index 0 to 5 (inclusive) and the columns "age" and "sex" from the diabetes_df DataFrame.
# The .loc indexer is used to select data by label, and it includes the end index in the selection.


print("\n=============================================================================================")
print("==== outout of diabetes_df.iloc[0] ====")
print("===============================================================================================\n")
print(diabetes_df.iloc[0])
# this will return the first row of the diabetes_df DataFrame as a pandas Series.


print("\n=============================================================================================")
print("==== outout of diabetes_df.iloc[0:5] ====")      
print("===============================================================================================\n")
print(diabetes_df.iloc[0:5])              
# this will return the rows from index 0 to 4 (inclusive) of the diabetes_df DataFrame as a new DataFrame.

print("\n=============================================================================================")
print("==== outout of diabetes_df.iloc[0:5, [0, 1]] ====")
print("===============================================================================================\n")      
print(diabetes_df.iloc[0:5, [0, 1]])      
# this will return the rows from index 0 to 4 (inclusive) and the columns at index 0 and 1 (which correspond to "age" and "sex") from the diabetes_df DataFrame.
# The .iloc indexer is used to select data by integer position, and it does not include the end index in the selection.       





==== outout of diabetes_df.loc[0] ====

age    0.038076
sex    0.050680
bmi    0.061696
bp     0.021872
s1    -0.044223
s2    -0.034821
s3    -0.043401
s4    -0.002592
s5     0.019907
s6    -0.017646
Name: 0, dtype: float64

==== outout of diabetes_df.loc[0:5] ====

        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   
5 -0.092695 -0.044642 -0.040696 -0.019442 -0.068991 -0.079288  0.041277   

         s4        s5        s6  
0 -0.002592  0.019907 -0.017646  
1 -0.039493 -0.068332 -0.092204  
2 -0.002592  0.002861 -0.025930  
3  0.034309  0.022688 -0.009362  
4 -0.002592 -0.031988 -0.046641  
5 


# Conditional selection in pandas
> Conditional selection in pandas allows you to filter rows of a DataFrame based on specific conditions. You can use boolean indexing to achieve this. The syntax for conditional selection is as follows:

```python     
dataframe[condition]
```     

Where `condition` is a boolean expression that evaluates to True or False for each row in the DataFrame. For example, if you want to select rows where the value in the 'age' column is greater than 30, you can do it like this:

```python     
filtered_data = dataframe[dataframe['age'] > 30]
```
This will return a new DataFrame containing only the rows where the condition is met. You can also combine multiple conditions using logical operators (e.g., `&` for AND, `|` for OR) to filter data based on multiple criteria. For example:

```python     
filtered_data = dataframe[(dataframe['age'] > 30) & (dataframe['gender'] == 'female')]
```    
This will return a new DataFrame containing only the rows where the age is greater than 30 and the gender is    female.





In [13]:
selectore = diabetes_df["age"] > 0.05
print("\n=============================================================================================/n")      
print(diabetes_df[selectore])      
# this will return a new DataFrame that contains only the rows where the value in the "age" column is greater than 0.05.      







          age       sex       bmi        bp        s1        s2        s3  \
2    0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
7    0.063504  0.050680 -0.001895  0.066629  0.090620  0.108914  0.022869   
17   0.070769  0.050680  0.012117  0.056301  0.034206  0.049416 -0.039719   
28   0.052606 -0.044642 -0.021295 -0.074527 -0.040096 -0.037639 -0.006584   
29   0.067136  0.050680 -0.006206  0.063187 -0.042848 -0.095885  0.052322   
..        ...       ...       ...       ...       ...       ...       ...   
402  0.110727  0.050680 -0.033151 -0.022885 -0.004321  0.020293 -0.061809   
408  0.063504 -0.044642 -0.050396  0.107944  0.031454  0.019354 -0.017629   
412  0.074401 -0.044642  0.085408  0.063187  0.014942  0.013091  0.015505   
414  0.081666  0.050680  0.006728 -0.004534  0.109883  0.117056 -0.032356   
431  0.070769  0.050680 -0.030996  0.021872 -0.037344 -0.047034  0.033914   

           s4        s5        s6  
2   -0.002592  0.002861 -0.025930  
7 

# apply() & map() functions 

apply() and map() are two powerful functions in pandas that allow you to apply a function to each element in a Series or DataFrame.  



---

### apply()

The `apply()` function is used to apply a function along an axis of the DataFrame (either rows or columns) or to each element in a Series. It can be used for both Series and DataFrames. For example, you can use `apply()` to calculate the mean of each column in a DataFrame:

```python
series.apply(func)  # Apply a function to each element in the Series         
dataframe.apply( func, axis=0)  # Apply a function to each column (axis=0)   
```


In [14]:
import pandas as pd  
import numpy as np 
from IPython.display import display  # this is used to display the DataFrame in a more readable format in Jupyter notebooks.

students = pd.Series(data=[85, 90, 78, 92, 88],
                        index=['Alice', 'Bob', 'Charlie', 'David', 'Eve'],   
                        name='Scores'   )

display(students)
print("Datatype of students: " + str(type(students)))       
   
print("=============================================================================================\n")

display(students.apply(lambda x: x + 5))  # Example: add 5 to each score
display(students.apply(lambda x: x * 2))  # Example: double each score 


Alice      85
Bob        90
Charlie    78
David      92
Eve        88
Name: Scores, dtype: int64

Datatype of students: <class 'pandas.core.series.Series'>



Alice      90
Bob        95
Charlie    83
David      97
Eve        93
Name: Scores, dtype: int64

Alice      170
Bob        180
Charlie    156
David      184
Eve        176
Name: Scores, dtype: int64

In [15]:
data1 = pd.DataFrame({'EmployeeName': ['Callen Dunkley', 'Sarah Rayner', 'Jeanette Sloan', 'Kaycee Acosta', 'Henri Conroy', 'Emma Peralta', 'Martin Butt', 'Alex Jensen', 'Kim Howarth', 'Jane Burnett'],
                    'Department': ['Accounting', 'Engineering', 'Engineering', 'HR', 'HR', 'HR', 'Data Science', 'Data Science', 'Accounting', 'Data Science'],
                    'HireDate': [2010, 2018, 2012, 2014, 2014, 2018, 2020, 2018, 2020, 2012],
                    'Sex': ['M', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'F'],
                    'Birthdate': ['04/09/1982', '14/04/1981', '06/05/1997', '08/01/1986', '10/10/1988', '12/11/1992', '10/04/1991', '16/07/1995', '08/10/1992', '11/10/1979'],
                    'Weight': [78, 80, 66, 67, 90, 57, 115, 87, 95, 57],
                    'Height': [176, 160, 169, 157, 185, 164, 195, 180, 174, 165],
                    'Kids': [2, 1, 0, 1, 1, 0, 2, 0, 3, 1]
                    })
display(data1)


Unnamed: 0,EmployeeName,Department,HireDate,Sex,Birthdate,Weight,Height,Kids
0,Callen Dunkley,Accounting,2010,M,04/09/1982,78,176,2
1,Sarah Rayner,Engineering,2018,F,14/04/1981,80,160,1
2,Jeanette Sloan,Engineering,2012,F,06/05/1997,66,169,0
3,Kaycee Acosta,HR,2014,F,08/01/1986,67,157,1
4,Henri Conroy,HR,2014,M,10/10/1988,90,185,1
5,Emma Peralta,HR,2018,F,12/11/1992,57,164,0
6,Martin Butt,Data Science,2020,M,10/04/1991,115,195,2
7,Alex Jensen,Data Science,2018,M,16/07/1995,87,180,0
8,Kim Howarth,Accounting,2020,M,08/10/1992,95,174,3
9,Jane Burnett,Data Science,2012,F,11/10/1979,57,165,1


In [16]:
def first_name(name):
    return name.split()[0]  # this function takes a name as input and returns the first name of that name.     


data1["FirstName"] = data1["EmployeeName"].apply(first_name)
data1["LastName"] = data1["EmployeeName"].apply(lambda x: x.split()[1]) 
# this will create a new column "LastName" by applying a lambda function that splits the "EmployeeName" and takes the second part (the last name).     


display(data1)


Unnamed: 0,EmployeeName,Department,HireDate,Sex,Birthdate,Weight,Height,Kids,FirstName,LastName
0,Callen Dunkley,Accounting,2010,M,04/09/1982,78,176,2,Callen,Dunkley
1,Sarah Rayner,Engineering,2018,F,14/04/1981,80,160,1,Sarah,Rayner
2,Jeanette Sloan,Engineering,2012,F,06/05/1997,66,169,0,Jeanette,Sloan
3,Kaycee Acosta,HR,2014,F,08/01/1986,67,157,1,Kaycee,Acosta
4,Henri Conroy,HR,2014,M,10/10/1988,90,185,1,Henri,Conroy
5,Emma Peralta,HR,2018,F,12/11/1992,57,164,0,Emma,Peralta
6,Martin Butt,Data Science,2020,M,10/04/1991,115,195,2,Martin,Butt
7,Alex Jensen,Data Science,2018,M,16/07/1995,87,180,0,Alex,Jensen
8,Kim Howarth,Accounting,2020,M,08/10/1992,95,174,3,Kim,Howarth
9,Jane Burnett,Data Science,2012,F,11/10/1979,57,165,1,Jane,Burnett


### map()

The `map()` function is used to map a function to each element in a Series, returning a new Series with the transformed values. It is typically used for transforming or mapping values in a Series based on a dictionary or a function. For example, you can use `map()` to replace values in a Series based on a dictionary:

```python
series.map(func)  # Map a function to each element in the Series, returning a new Series with the transformed values.   
dataframe.map(func)  # Map a function to each element in the DataFrame, returning a new DataFrame with the transformed values.                
```    



# Concatenation and merging 
> Concatenation and merging are two common operations in pandas that allow you to combine DataFrames or Series in different ways. 





### concat()
The `concat()` function in pandas is used to concatenate two or more DataFrames along a specified axis (rows or columns). The syntax for the `concat()` function is as follows:

```python
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False)
```     

Where: 
- `objs`: A list or dictionary of pandas objects (DataFrames or Series) to concatenate.
- `axis`: The axis along which to concatenate. Use `0` for rows and ` 
- `1` for columns. The default is `0`.
- `join`: The type of join to perform. Use `outer` for a union of the indexes, `inner` for an intersection of the indexes, `left` for using only the index from the left DataFrame, and `right` for using only the index from the right DataFrame. The default is `outer`.
- `ignore_index`: If `True`, the resulting DataFrame will have a new integer index. If `False`, the original indexes will be retained. The default is `False`.
- `keys`: If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.
- `levels`: Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
- `names`: Names for the levels in the resulting hierarchical index.  
- `verify_integrity`: Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation. The default is `False`.
- `sort`: Sort non-concatenation axis if it is not already aligned when `join` is not `outer`. The default is `False`. 
  


In [17]:
import pandas as pd
import IPython.display as display  

# Create two sample DataFrames
temp_data = pd.DataFrame({'Date': ['12-02-2023', '13-02-2023', '14-02-2023', '15-02-2023', '16-02-2023'],
                    'TempMax': [24.3, 26.9, 23.4, 15.5, 16.1 ] })

rainfall_data = pd.DataFrame({'Date': ['12-02-2023', '13-02-2023', '14-02-2023', '15-02-2023', '16-02-2023'],
                    'Rainfall': [0, 3.6, 3.6, 39.8, 2.8 ] })


display.display(temp_data)
print("\n=============================================================================================\n")
display.display(rainfall_data)

print("\n=============================================================================================")      
print("=============================================================================================\n")


final_concat = pd.concat([temp_data, rainfall_data], axis=0)
print("output of pd.concat([temp_data, rainfall_data], axis=0):\n")
display.display(final_concat)
# this will concatenate the temp_data and rainfall_data DataFrames vertically (row-wise) because we specified axis=0. The resulting DataFrame will have all the rows from both DataFrames, and the columns will be aligned based on their names. Since both DataFrames have a "Date" column, it will be included in the final concatenated DataFrame. However, since the "TempMax" and "Rainfall" columns are not present in both DataFrames, they will contain NaN values for the rows where they are not available.   

print("\n=============================================================================================\n")      
final_concat = pd.concat([temp_data, rainfall_data], axis=1)
print("output of pd.concat([temp_data, rainfall_data], axis=1):\n")
display.display(final_concat)  
# this will concatenate the temp_data and rainfall_data DataFrames horizontally (column-wise) because we specified axis=1. The resulting DataFrame will have all the columns from both DataFrames, and the rows will be aligned based on their index. Since both DataFrames have a "Date" column, it will be included in the final concatenated DataFrame. However, since the "TempMax" and "Rainfall" columns are not present in both DataFrames, they will contain NaN values for the rows where they are not available.            



Unnamed: 0,Date,TempMax
0,12-02-2023,24.3
1,13-02-2023,26.9
2,14-02-2023,23.4
3,15-02-2023,15.5
4,16-02-2023,16.1






Unnamed: 0,Date,Rainfall
0,12-02-2023,0.0
1,13-02-2023,3.6
2,14-02-2023,3.6
3,15-02-2023,39.8
4,16-02-2023,2.8




output of pd.concat([temp_data, rainfall_data], axis=0):



Unnamed: 0,Date,TempMax,Rainfall
0,12-02-2023,24.3,
1,13-02-2023,26.9,
2,14-02-2023,23.4,
3,15-02-2023,15.5,
4,16-02-2023,16.1,
0,12-02-2023,,0.0
1,13-02-2023,,3.6
2,14-02-2023,,3.6
3,15-02-2023,,39.8
4,16-02-2023,,2.8




output of pd.concat([temp_data, rainfall_data], axis=1):



Unnamed: 0,Date,TempMax,Date.1,Rainfall
0,12-02-2023,24.3,12-02-2023,0.0
1,13-02-2023,26.9,13-02-2023,3.6
2,14-02-2023,23.4,14-02-2023,3.6
3,15-02-2023,15.5,15-02-2023,39.8
4,16-02-2023,16.1,16-02-2023,2.8


In [18]:
final_concat_using_key = pd.concat([temp_data, rainfall_data], keys=['a' , 'b'])
print("\n=============================================================================================\n")
print("output of pd.concat([temp_data, rainfall_data], keys=['a' , 'b']):")
display.display(final_concat_using_key)    
# this will concatenate the temp_data and rainfall_data DataFrames and create a hierarchical index using the keys 'a' and 'b'. The resulting DataFrame will have a multi-level index where the first level is the key ('a' for temp_data and 'b' for rainfall_data) and the second level is the original index of each DataFrame. The columns will be aligned based on their names, and since both DataFrames have a "Date" column, it will be included in the final concatenated DataFrame. However, since the "TempMax" and "Rainfall" columns are not present in both DataFrames, they will contain NaN values for the rows where they are not available. 


print("\n=============================================================================================\n")
print("output of final_concat_using_key.loc['a']:")     
display.display(final_concat_using_key.loc['a'])
# this will return the portion of the final_concat_using_key DataFrame that corresponds to the key 'a', which contains the data from the temp_data DataFrame. The resulting DataFrame will have the original index of temp_data and the columns "Date" and "TempMax". The "Rainfall" column will contain NaN values since it is not present in temp_data.






output of pd.concat([temp_data, rainfall_data], keys=['a' , 'b']):


Unnamed: 0,Unnamed: 1,Date,TempMax,Rainfall
a,0,12-02-2023,24.3,
a,1,13-02-2023,26.9,
a,2,14-02-2023,23.4,
a,3,15-02-2023,15.5,
a,4,16-02-2023,16.1,
b,0,12-02-2023,,0.0
b,1,13-02-2023,,3.6
b,2,14-02-2023,,3.6
b,3,15-02-2023,,39.8
b,4,16-02-2023,,2.8




output of final_concat_using_key.loc['a']:


Unnamed: 0,Date,TempMax,Rainfall
0,12-02-2023,24.3,
1,13-02-2023,26.9,
2,14-02-2023,23.4,
3,15-02-2023,15.5,
4,16-02-2023,16.1,


In [19]:
# Creating the first DataFrame
data1 = {'Name': ['Alice', 'Bob', 'Charlie'],
         'Age': [25, 30, 35],
         'Score': [85, 90, 88]}
df1 = pd.DataFrame(data1)

# Creating the second DataFrame
data2 = {'Name': ['David', 'Eve', 'Charlie'],
         'Age': [27, 32, 35],
         'Score': [82, 88, 88],
         "extra":[100,100,100]}
df2 = pd.DataFrame(data2)

print("DataFrame 1:")
display.display(df1)
print("\nDataFrame 2:")
display.display(df2)


print("\n=============================================================================================\n")
print("output of pd.concat([df1, df2], axis=1):")
final_concat = pd.concat([df1, df2], axis=1)
display.display(final_concat)


DataFrame 1:


Unnamed: 0,Name,Age,Score
0,Alice,25,85
1,Bob,30,90
2,Charlie,35,88



DataFrame 2:


Unnamed: 0,Name,Age,Score,extra
0,David,27,82,100
1,Eve,32,88,100
2,Charlie,35,88,100




output of pd.concat([df1, df2], axis=1):


Unnamed: 0,Name,Age,Score,Name.1,Age.1,Score.1,extra
0,Alice,25,85,David,27,82,100
1,Bob,30,90,Eve,32,88,100
2,Charlie,35,88,Charlie,35,88,100


In [20]:
# 

final_concat_using_join = pd.concat([df1, df2], join='inner')
print("\n=============================================================================================")             
print("output of pd.concat([df1, df2], join='inner']):")
display.display(final_concat_using_join)
# this will concatenate the df1 and df2 DataFrames and perform an inner join on the columns. The resulting DataFrame will only include the rows where the values in the "Name", "Age", and "Score" columns match in both DataFrames. Since "Charlie" is the only name that appears in both DataFrames with the same age and score, the resulting DataFrame will contain only that row. The "extra" column from df2 will not be included in the final concatenated DataFrame since it does not have a matching column in df1.




output of pd.concat([df1, df2], join='inner']):


Unnamed: 0,Name,Age,Score
0,Alice,25,85
1,Bob,30,90
2,Charlie,35,88
0,David,27,82
1,Eve,32,88
2,Charlie,35,88


### merge()
The `merge()` function in pandas is used to merge two DataFrames based on a common column or index. It is similar to SQL joins and allows you to combine DataFrames based on shared keys. The syntax for the `merge()` function is as follows:

```python
dataframe.merge(other, on=None, how='inner', left_on=None, right_on=None, left_index=False, right_index=False, sort=False)
```     

Where: 
- `other`: The DataFrame to merge with the current DataFrame.
- `on`: The column or index level names to join on. These must be found in both DataFrames. If not specified and the DataFrames have a common column name, it will be used as the key for merging.
- `how`: The type of merge to perform. Use `inner` for an inner join, `outer` for a full outer join, `left` for a left join, and `right` for a right join. The default is `inner`.
- `left_on`: The column or index level names from the left DataFrame to use as keys. Can be a single column name or a list of column names.
- `right_on`: The column or index level names from the right DataFrame to use as keys. Can be a single column name or a list of column names.
- `left_index`: If `True`, use the index from the left DataFrame as the join key. The default is `False`.
- `right_index`: If `True`, use the index from the right DataFrame as the join key. The default is `False`.
- `sort`: Sort the resulting DataFrame by the join keys. The default is `False`.    
- `validate`: If specified, checks if the merge is of the specified type. For example, if `validate='one_to_one'`, it will check if the merge is a one-to-one merge. The default is `None`, which means no validation will be performed.      





In [44]:
data1 = {
    "ID" : [10001,20002,30003,40004,50005],
    "Numbers": [10,20,20,40,50],
    "Letters":["A","B","C","D","E"]
}

df1 = pd.DataFrame(data1)  
display.display(df1) 

data2 = {
    "ID" : [10001,20002,30003,60006,70007],
    "Numbers": [10,20,30,40,60],
    "City":["Lucknow","Munnar","Chennai","Delhi","Jaipur"]
}

df2 = pd.DataFrame(data2)
display.display(df2) 

Unnamed: 0,ID,Numbers,Letters
0,10001,10,A
1,20002,20,B
2,30003,20,C
3,40004,40,D
4,50005,50,E


Unnamed: 0,ID,Numbers,City
0,10001,10,Lucknow
1,20002,20,Munnar
2,30003,30,Chennai
3,60006,40,Delhi
4,70007,60,Jaipur


In [57]:
display.display(df1.merge(df2)) 
# This will merge the df1 and df2 DataFrames based on the common columns "ID" and "Numbers". The resulting DataFrame will only include the rows where the values in both columns match in both DataFrames. In this case, the rows with ID 10001 and 20002 will be included in the merged DataFrame, while the other rows will be excluded since they do not have matching values in both columns. The resulting DataFrame will contain the columns "ID", "Numbers", "Letters", and "City" for the matching rows. 

display.display(df1.merge(df2, on=["ID", "Numbers"]))
# this will merge the df1 and df2 DataFrames based on the "ID" and "Numbers" columns. The resulting DataFrame will only include the rows where the values in both columns match in both DataFrames. In this case, the rows with ID 10001 and 20002 will be included in the merged DataFrame, while the other rows will be excluded since they do not have matching values in both columns. The resulting DataFrame will contain the columns "ID", "Numbers", "Letters", and "City" for the matching rows.

display.display(df1.merge(df2, on="ID" , how="inner"))
# this will merge the df1 and df2 DataFrames based on the "ID" column and perform an inner join. The resulting DataFrame will only include the rows where the "ID" values match in both DataFrames. In this case, the matching IDs are 10001, 20002, and 30003. The resulting DataFrame will contain the columns "ID", "Numbers_x", "Letters", "Numbers_y", and "City". The "Numbers_x" column will contain the values from df1, while the "Numbers_y" column will contain the values from df2 for the matching IDs. The rows with IDs 40004 and 50005 from df1, and 60006 and 70007 from df2 will not be included in the final merged DataFrame since they do not have matching IDs in the other DataFrame.




Unnamed: 0,ID,Numbers,Letters,City
0,10001,10,A,Lucknow
1,20002,20,B,Munnar


Unnamed: 0,ID,Numbers,Letters,City
0,10001,10,A,Lucknow
1,20002,20,B,Munnar


Unnamed: 0,ID,Numbers_x,Letters,Numbers_y,City
0,10001,10,A,10,Lucknow
1,20002,20,B,20,Munnar
2,30003,20,C,30,Chennai


In [59]:
display.display(df1.merge(df2, on="ID", how="outer"))
# this will merge the df1 and df2 DataFrames based on the "ID" column and perform an outer join. The resulting DataFrame will include all rows from both DataFrames, and where there are matching "ID" values, the corresponding data from both DataFrames will be combined. For rows with matching IDs (10001, 20002, and 30003), the resulting DataFrame will contain the columns "ID", "Numbers_x", "Letters", "Numbers_y", and "City". For rows that do not have a match in the other DataFrame (IDs 40004 and 50005 from df1, and 60006 and 70007 from df2), the resulting DataFrame will still include those rows, but the columns from the other DataFrame will contain NaN values for those unmatched rows.




Unnamed: 0,ID,Numbers_x,Letters,Numbers_y,City
0,10001,10.0,A,10.0,Lucknow
1,20002,20.0,B,20.0,Munnar
2,30003,20.0,C,30.0,Chennai
3,40004,40.0,D,,
4,50005,50.0,E,,
5,60006,,,40.0,Delhi
6,70007,,,60.0,Jaipur


# Comparision 

#### comparison operators

> In pandas, you can use comparison operators to compare values in a Series or DataFrame. The comparison operators include:
- `==`: Equal to     
- `!=`: Not equal to
- `<`: Less than
- `>`: Greater than
- `<=`: Less than or equal to
- `>=`: Greater than or equal to

When you use these operators on a Series or DataFrame, they will return a new Series or DataFrame of boolean values (True or False) indicating the result of the comparison for each element. For example, if you have a DataFrame `df` and you want to compare the values in the 'age' column to 30, you can do it like this:

```python  

df['age'] > 30
```    

This will return a Series of boolean values where each value is `True` if the corresponding value in the 'age' column is greater than 30, and `False` otherwise. You can also use these comparison operators to filter data based on conditions. For example, to select rows where the age is greater than 30, you can do:

```python
df[df['age'] > 30]
```




### .compare() function 

The `.compare()` function in pandas is used to compare two DataFrames and highlight the differences between them. It returns a new DataFrame that shows the differences between the two DataFrames, with the values from the first DataFrame on the left and the values from the second DataFrame on the right. The syntax for the `.compare()` function is as follows:

Pandas, the compare() function provides a way to compare two DataFrame objects and generate a DataFrame highlighting the differences between them. This can be particularly useful when you have two datasets and want to identify discrepancies or changes between them

```python     
dataframe1.compare(dataframe2, align_axis=1, keep_shape=False, keep_equal=False)
```     

Where: 
- `dataframe1`: The first DataFrame to compare.
- `dataframe2`: The second DataFrame to compare.
  - `align_axis`: The axis to align the DataFrames on. Use `0` to align on rows and `1` to align on columns. The default is `1`.
- `keep_shape`: If `True`, the resulting DataFrame will have the same shape as the original DataFrames, with NaN values where there are no differences. If `False`, the resulting DataFrame will only include rows and columns where there are differences. The default is `False`.
- `keep_equal`: If `True`, the resulting DataFrame will include all values, even those that are equal between the two DataFrames. If `False`, only the differing values will be included in the resulting DataFrame. The default is `False`.  



The `.compare()` function is particularly useful for identifying and analyzing differences between two DataFrames, such as when you want to compare the results of two different data processing steps or when you want to track changes in a dataset over time.          




In [26]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
    },
    columns=["col1", "col2", "col3"],
)

display.display(df1)

Unnamed: 0,col1,col2,col3
0,a,1.0,1.0
1,a,2.0,2.0
2,b,3.0,3.0
3,b,,4.0
4,a,5.0,5.0


In [30]:
df2 = df1.copy() 
# this will create a new DataFrame df2 that is a copy of df1. Any changes made to df2 will not affect df1, and vice versa. This is useful when you want to work with a DataFrame without modifying the original one.     



display.display(df1.compare(df2))  
# this will compare the df1 and df2 DataFrames and return a new DataFrame that shows the differences between them. Since df1 and df2 are identical (because df2 is a copy of df1), the resulting DataFrame will indicate that there are no differences between the two DataFrames. The output will show that all values in df1 and df2 are the same, and there will be no entries in the resulting DataFrame indicating any differences.


In [32]:
df1.loc[0,"col1"] = "shivam"
display.display(df1.compare(df2))  
# this will compare the df1 and df2 DataFrames again after modifying the value in df1. The resulting DataFrame will show the differences between df1 and df2, which will indicate that the value in the "col1" column at index 0 has changed from "a" to "shivam". The output will show the original value from df2 and the new value from df1 for that specific cell, while all other cells will indicate that they are the same.  

df1.loc[2,"col3"] = 4.0
display.display(df1.compare(df2))


Unnamed: 0_level_0,col1,col1
Unnamed: 0_level_1,self,other
0,shivam,a


Unnamed: 0_level_0,col1,col1,col3,col3
Unnamed: 0_level_1,self,other,self,other
0,shivam,a,,
2,,,4.0,3.0


# Pivot tables 
> A pivot table is a powerful data summarization tool that allows you to aggregate and analyze data in a DataFrame. It is used to reshape and summarize data by grouping it based on one or more columns and applying aggregation functions to the grouped data. In pandas, you can create pivot tables using the `pivot_table()` function. The syntax for the `pivot_table()` function is as follows:

```python
pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
```     

Where:
- `data`: The DataFrame to be used for creating the pivot table.
- `values`: The column(s) to be aggregated. If not specified, all numeric columns will be aggregated.
- `index`: The column(s) to group by on the rows.
- `columns`: The column(s) to group by on the columns.
- `aggfunc`: The aggregation function to apply to the grouped data. The default is 'mean', but you can use other functions such as 'sum', 'count', 'min', 'max', etc.
- `fill_value`: The value to replace missing values with in the resulting pivot table.
- `margins`: If `True`, adds all row/columns (e.g., for subtotal / grand totals). The default is `False`.
- `dropna`: If `True`, do not include columns whose entries are all NaN. The default is `True`.
- `margins_name`: Name of the row/column that will contain the totals when `margins` is `True`. The default is 'All'.  
- `sort`: Sort the resulting DataFrame by the index. The default is `True`.  
- `observed`: This parameter is used when you have categorical data in your pivot table. If `True`, it will only include the observed categories in the resulting pivot table. If `False`, it will include all categories, even those that are not observed in the data. The default is `False`.      

Pivot tables are particularly useful for summarizing and analyzing large datasets, allowing you to quickly identify patterns, trends, and relationships in the data. They can be used for various purposes, such as calculating totals, averages, or other aggregate statistics based on different groupings of the data.  



In [36]:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {
    'Date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02'],
    'Category': ['A', 'B', 'A', 'B'],
    'Value': [10, 20, 30, 40]
}
df = pd.DataFrame(data)
display.display(df)

# Create a pivot table
pivot_table = df.pivot_table(values='Value', index='Date', columns='Category', aggfunc='sum')
display.display(pivot_table)


Unnamed: 0,Date,Category,Value
0,2022-01-01,A,10
1,2022-01-01,B,20
2,2022-01-02,A,30
3,2022-01-02,B,40


Category,A,B
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-01-01,10,20
2022-01-02,30,40


In [37]:
data = {}
np.random.seed(2)
for i in [chr(x) for x in range(65,70)]:
  data['col'+i] = np.random.randint(1,100,10)
data['orderID'] = np.random.choice(['A', 'B', 'C'], 10)
data['product'] = np.random.choice(['Product1', 'Product2', 'Product3'], 10)
data['customer'] = np.random.choice(['Customer1', 'Customer2', 'Customer3', 'Customer4'], 10)
df = pd.DataFrame(data)

display.display(df)

Unnamed: 0,colA,colB,colC,colD,colE,orderID,product,customer
0,41,96,68,69,51,B,Product2,Customer3
1,16,76,5,47,5,C,Product2,Customer1
2,73,86,43,71,91,B,Product2,Customer1
3,23,48,52,96,64,B,Product3,Customer3
4,44,64,39,84,80,A,Product1,Customer3
5,83,32,34,32,50,B,Product3,Customer1
6,76,91,59,67,40,C,Product3,Customer3
7,8,21,68,81,47,A,Product1,Customer1
8,35,38,70,53,9,C,Product1,Customer2
9,50,40,89,77,51,B,Product3,Customer3


In [43]:
pivot_table2 = df.pivot_table(values= "colA" , index="orderID" , columns="product" , aggfunc="sum")
display.display(pivot_table2)

product,Product1,Product2,Product3
orderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,52.0,,
B,,114.0,156.0
C,35.0,16.0,76.0


# modifying data in pandas
> Modifying data in pandas can be done using various methods, such as assignment, the `apply()` function, or using built-in functions to perform operations on the DataFrame. Here are some common ways to modify data in pandas:      
> - Assignment: You can modify data in a DataFrame by assigning new values to specific cells, rows, or columns. For example, to change the value of a specific cell, you can use `.loc` or `.iloc`:

```python            
dataframe.loc[row_label, column_label] = new_value  # Using .loc for label-based indexing
dataframe.iloc[row_position, column_position] = new_value  # Using .iloc for integer position-based indexing

df['column_name'] = new_values  # Modifying an entire column   
df['new_column'] = dataframe['existing_column'] * 200  # Creating a new column based on existing data      

```

> - `apply()` function: You can use the `apply()` function to apply a custom function to each element, row, or column of a DataFrame. For example, to apply a function to each element in a column:

```python     
df['column_name'] = df['column_name'].apply(lambda x: x * 2)  
# Example of applying a function to double the values in a column
```



## drop() function in pandas
> The `drop()` function in pandas is used to remove specified labels from rows or columns of a DataFrame. You can use it to drop rows or columns based on their labels. The syntax for the `drop()` function is as follows:

```python     
dataframe.drop(labels=None, axis=0, index=None, columns=None, inplace=False)
```     

Where: 
- `labels`: The labels to drop. This can be a single label or a list of labels.
- `axis`: The axis along which to drop the labels. Use `0` to drop rows and `1` to drop columns. The default is `0`.
- `index`: Alternative to `labels` when dropping rows. You can specify the index labels to drop.
- `columns`: Alternative to `labels` when dropping columns. You can specify the column labels to drop.
- `inplace`: If `True`, the operation will be performed in place and the original DataFrame will be modified. If `False`, a new DataFrame with the specified labels dropped will be returned. The default is `False`.    



## sample() function in pandas
> The `sample()` function in pandas is used to generate a random sample of rows from a DataFrame. It allows you to randomly select a specified number of rows or a fraction of the total rows. The syntax for the `sample()` function is as follows:

```python
dataframe.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
```

Where:
- `n`: The number of rows to return in the sample. If not specified, it will return one row.
- `frac`: The fraction of rows to return in the sample. For example, `frac= 0.1` will return 10% of the rows. If not specified, it will return one row.
- `replace`: Whether to allow sampling of the same row more than once. The default is `False`.
- `weights`: A list or array of weights to assign to each row for sampling. The default is `None`, which means equal probability for all rows.
- `random_state`: A seed value to ensure reproducibility of the random sample. The default is `None`.
- `axis`: The axis along which to sample. Use `0` for rows and `1` for columns. The default is `0`.             

The `sample()` function is useful for creating random subsets of your data for analysis, testing, or validation purposes. You can specify the number of rows or the fraction of rows you want to sample, and you can also control whether to allow duplicates in the sample.     


# Aggregation functions in pandas
> Aggregation functions in pandas are used to perform operations on groups of data to summarize or aggregate the data. These functions can be applied to DataFrames or Series to compute various statistics, such as sum, mean, count, etc. Some common aggregation functions in pandas include:
- `sum()`: Computes the sum of values.    
- `mean()`: Computes the mean (average) of values.
- `count()`: Counts the number of non-null values.
- `min()`: Computes the minimum value.
- `max()`: Computes the maximum value.
- `std()`: Computes the standard deviation of values.
- `var()`: Computes the variance of values.
- `median()`: Computes the median of values.
- `mode()`: Computes the mode of values (the most frequent value).

> using axis parameter in aggregation functions
> You can use the `axis` parameter in aggregation functions to specify whether to perform the operation along rows or columns. For example, if you want to compute the sum of each column in a DataFrame, you can use `dataframe.sum(axis=0)`. If you want to compute the sum of each row, you can use `dataframe.sum(axis=1)`. The `axis` parameter allows you to control the direction of the aggregation operation.       
> - `axis=0`: Perform the operation along columns (i.e., compute the sum of each column).
> - `axis=1`: Perform the operation along rows (i.e., compute the sum of each row).



In [22]:



random_state = np.random.RandomState(42) 
# The np.random.RandomState(42) creates a random number generator with a fixed seed (42 in this case). 
# This allows you to generate the same sequence of random numbers every time you run the code, which can be useful for reproducibility.  

random_numbers = random_state.rand(5)
print(type(random_numbers)) 
# this will return <class 'numpy.ndarray'>, which indicates that random_numbers is a NumPy array.
print(random_numbers)       
# this will print the array of 5 random numbers generated by the random_state object. 


random_series = pd.Series(random_state.rand(5))
print(type(random_series))
print(random_series) 


print("\n=============================================================================================/n")

print("Standard deviation: "+str(random_series.std())) 
# this will calculate and return the standard deviation of the values in the random_series pandas Series. 
print("Mean: "+str(random_series.mean())) 
# this will calculate and return the mean (average) of the values in the random_series pandas Series. 
print("Sum: "+str(random_series.sum())) 
# this will calculate and return the sum of the values in the random_series pandas Series. 




<class 'numpy.ndarray'>
[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
<class 'pandas.core.series.Series'>
0    0.155995
1    0.058084
2    0.866176
3    0.601115
4    0.708073
dtype: float64

Standard deviation: 0.3531248563609777
Mean: 0.4778883735637184
Sum: 2.389441867818592




# groupby() function in pandas
> The `groupby()` function in pandas is used to group data based on one or more columns and perform operations on those groups. It allows you to split the data into groups, apply a function to each group, and then combine the results back into a DataFrame. The syntax for the `groupby()` function is as follows:

```python     
dataframe.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, dropna=True)
```     

Where: 
- `by`: The column(s) to group by. This can be a single column name, a list of column names, or a function that takes a DataFrame and returns a Series to group by.
- `axis`: The axis to group along. Use `0` for rows and `1` for columns. The default is `0`.
- `level`: If the DataFrame has a MultiIndex, you can specify the level(s) to group by.
- `as_index`: If `True`, the group labels will be used as the index of       the resulting DataFrame. If `False`, the group labels will be added as a column. The default is `True`.
- `sort`: Whether to sort the group keys. The default is `True`.      
- `group_keys`: Whether to add group keys to the index of the resulting DataFrame. The default is `True`.
- `squeeze`: If `True`, the result will be squeezed to a Series if possible. The default is `False`.
- `observed`: This parameter is used when grouping by categorical data. If `True`, only the observed categories will be included in the result. The default is `False`.
- `dropna`: Whether to drop groups that contain NaN values. The default is `True`.  
  





three stages of groupby() function in pandas
1. Splitting: we split dataframe into multiple dataframes based on the values or keys 
2. Applying: we apply a described function to each dataframe.
3. Combining: The results from the applying state into a dataframe.    

![groupbyDescription.png](attachment:groupbyDescription.png)


In [23]:

print(range(15))
print(type(range(15)))
# The range(30) function generates a sequence of numbers from 0 to 29.   
print(list(range(15)))    

df = pd.DataFrame({"key": ['A', 'B', 'C'] * 5, "data" : range(15)})
df


range(0, 15)
<class 'range'>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5
6,A,6
7,B,7
8,C,8
9,A,9


In [24]:
df.groupby("key").sum()   
# or df.groupby(by="key").sum()

# this will group the DataFrame by the "key" column and then calculate the sum of the "data" column for each group. 
# The result will be a new DataFrame with the unique values from the "key" column as the index and the corresponding sums of the "data" column as the values.  


Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,30
B,35
C,40






![plottingDF.png](attachment:plottingDF.png)


# importing data 
> pandas provides several functions to import data from various file formats, such as CSV, Excel, JSON, SQL databases, and more. Here are some common functions for importing data in pandas:
- `pd.read_csv()`: Used to read data from a CSV file.
- `pd.read_json()`: Used to read data from a JSON file.







```python
pd.read_csv(filepath_or_buffer, sep=',', header='infer', names=None, index_col=None, usecols=None, dtype=None, engine='c', converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True) 
``` 
Where:
- `filepath_or_buffer`: The file path or object to read from. This can be a string representing the file path, a file-like object, or a URL.
- `sep`: The delimiter to use when parsing the CSV file. The default is a comma (`,`).
- `header`: The row number(s) to use as the column names. The default is 'infer', which means that the first row will be used as the column names if it contains valid column names. You can also specify an integer or a list of integers to indicate the row(s) to use as the header.
- `names`: A list of column names to use. If the file does not contain a header row, you can specify the column names using this parameter.
- `index_col`: The column(s) to set as the index of the resulting DataFrame. This can be an integer, a string, or a list of integers or strings.



