# Pandas Data Frames
## 1. Creating a Pandas DataFrame
A DataFrame is a two-dimensional, labeled data structure in Pandas.
It is similar to an Excel spreadsheet or SQL table.

In [35]:
import numpy as np
import pandas as pd

In [95]:
from numpy.random import randn
np.random.seed(101)

In [232]:
# Creating 
df_rnd=pd.DataFrame(data=randn(5,4),index=['A','B','C','D','E'],columns=['W','X','Y','Z'])

In [234]:
df_rnd

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


## Creating a DataFrame from a dictionary
We create dataframe using a dictionary with scalar (single) values and a dictionary of lists.

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)


In [72]:
my_dict={"Name":"Sam","Age":40,"Salary":10000}

In [74]:
df=pd.DataFrame(data=my_dict,index=[0])

In [76]:
df

Unnamed: 0,Name,Age,Salary
0,Sam,40,10000


In [78]:
my_dict={'Name':['Beryl','Kath','Sam','Mandy'],
        'Age':[25,30,40,23],
        'Salary':[10000,20000,35000,12000],}

In [80]:
df=pd.DataFrame(data=my_dict)

In [82]:
df

Unnamed: 0,Name,Age,Salary
0,Beryl,25,10000
1,Kath,30,20000
2,Sam,40,35000
3,Mandy,23,12000


In [84]:
# Display the DataFrame

display(df)

Unnamed: 0,Name,Age,Salary
0,Beryl,25,10000
1,Kath,30,20000
2,Sam,40,35000
3,Mandy,23,12000


##  Accessing Data
You can access specific columns, rows, or values in a DataFrame.

In [117]:
# Access a single column
df_rnd['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [121]:
display(df['Name'])


0    Beryl
1     Kath
2      Sam
3    Mandy
Name: Name, dtype: object

In [108]:
type(df_rnd['W'])

pandas.core.series.Series

In [110]:
type(df)

pandas.core.frame.DataFrame

In [123]:
# Access multiple columns
display(df[['Name', 'Salary']])

Unnamed: 0,Name,Salary
0,Beryl,10000
1,Kath,20000
2,Sam,35000
3,Mandy,12000


In [129]:

# Access a specific row by index
display(df.iloc[1])

Name       Kath
Age          30
Salary    20000
Name: 1, dtype: object

## 4. Accessing Columns - Best Practices
When accessing a single column in a DataFrame, you might see both of these notations:

### 1️ Using dot notation (not recommended)
```
df.W
```
While this works, it is not the best practice because:
- It can cause confusion with built-in methods of Pandas DataFrames.
- If a column name conflicts with an existing method, unexpected behavior may occur.

### 2️ Using bracket notation (recommended)
```
df['W']
```
This is the preferred way because:
- It avoids conflicts with DataFrame methods.
- It is more explicit and readable.

For multiple columns, always use double brackets:
```
df[['W', 'X']]
```
This ensures a DataFrame is returned instead of a Series.

In [112]:
df_rnd.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

However with this method Python may gets confused, cause there are a bunch of methods available after `df.` the best way is to use `[]` notation when requesting a column name

In [135]:
df_rnd[['W', 'X']]

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318
C,-2.018168,0.740122
D,0.188695,-0.758872
E,0.190794,1.978757


## selecting Rows

In [141]:
# Locational labeled base insed
df_rnd.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [147]:
# Based off of index position
df_rnd.iloc[0]

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [149]:
# Selecting subset of rows and columns
df_rnd.loc['B','Y']

-0.8480769834036315

In [153]:
df_rnd.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


## Filtering Data

Filtering is used to select specific rows based on conditions.

In [157]:
# Select rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
display(filtered_df)

Unnamed: 0,Name,Age,Salary
2,Sam,40,35000


In [159]:
df_rnd>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [161]:
booldf=df_rnd>0


In [163]:
booldf

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [167]:
df_rnd[booldf]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [169]:
df_rnd['W']>0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [173]:
df_rnd

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Since we're passing a Boolean Series (`df_rnd['W'] > 0`), Pandas keeps only the rows where the condition evaluates to `True`. The resulting DataFrame excludes rows where `W` has negative or zero values.

In [179]:
# Filtering rows where column 'W' > 0
df_rnd[df_rnd['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [176]:
# All the data in dataframe, where Z is less 
df_rnd[df_rnd['Z']<0]

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001


`df_rnd['W'] > 0` creates a Boolean Series, which returns True for rows where column `'W'` is greater than 0 and False otherwise.

`df_rnd[df_rnd['W'] > 0]` filters the DataFrame to include only rows where `'W' > 0`.

`['Y']` selects only the column `'Y'` from the filtered DataFrame, returning a Pandas Series.

In [181]:
df_rnd[df_rnd['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

## Selecting Multiple Columns
The filtering step `df_rnd[df_rnd['W'] > 0]` remains the same (only rows where `'W' > 0` are kept).

`[['Y', 'X']]` selects both columns `'Y'` and `'X'`, returning a DataFrame instead of a Series.



In [185]:
df_rnd[df_rnd['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


In [197]:
df_rnd[(df_rnd['W']>0) & (df_rnd['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


## Handling Missing Values
Missing values in a DataFrame can be filled or removed.

In [191]:
 # Creating a DataFrame with missing values
data_with_nan = {
    'Name': ['Alice', 'Bob', None, 'David'],
    'Age': [25, None, 35, 40],
    'Salary': [50000, 60000, 70000, None]
}
df_nan = pd.DataFrame(data_with_nan)
display(df_nan)

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,,60000.0
2,,35.0,70000.0
3,David,40.0,


## Fill missing values

In [194]:
df_nan_filled = df_nan.fillna({'Name': 'Unknown', 'Age': df_nan['Age'].mean(), 'Salary': df_nan['Salary'].median()})
display(df_nan_filled)

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,33.333333,60000.0
2,Unknown,35.0,70000.0
3,David,40.0,60000.0


In [199]:
# Reset to default 0,1...n index
df_rnd.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [201]:
df_rnd

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [238]:
newind = 'CA NY WY OR CO'.split()

In [240]:
newind

['CA', 'NY', 'WY', 'OR', 'CO']

In [242]:
df_rnd['States']=newind

In [244]:
df_rnd

Unnamed: 0,W,X,Y,Z,States
A,0.302665,1.693723,-1.706086,-1.159119,CA
B,-0.134841,0.390528,0.166905,0.184502,NY
C,0.807706,0.07296,0.638787,0.329646,WY
D,-0.497104,-0.75407,-0.943406,0.484752,OR
E,-0.116773,1.901755,0.238127,1.996652,CO


setting the column `'States'` as the index of the DataFrame instead of the default numerical index.

In [215]:
df_rnd.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


In [220]:
df_rnd=df_rnd.set_index('States')

In [246]:
df_rnd.set_index('States',inplace=True)

In [248]:
df_rnd

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,0.302665,1.693723,-1.706086,-1.159119
NY,-0.134841,0.390528,0.166905,0.184502
WY,0.807706,0.07296,0.638787,0.329646
OR,-0.497104,-0.75407,-0.943406,0.484752
CO,-0.116773,1.901755,0.238127,1.996652


## Multi-Index and Index Hierarchy
Why Use MultiIndex?
- `Structured Data Representation`: Organizes data hierarchically for better structure.
- `Easier Subset Selection`: Allows easy access to groups and subgroups, enables easier selection of subsets of data.
- `Ideal for Grouped Operations`:  Useful for calculations involving multiple levels, useful for representing data from group-wise operations.


You can access data using:
```
df_multi.loc['G1']
df_multi.loc['G1'].loc[2]
```

In [260]:
# Defining Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]

`outside`: A list representing the first level of the index. It consists of two groups (G1 and G2), each appearing three times.

`inside`: A list representing the second level of the index. Each group (G1 and G2) has sub-levels (1, 2, and 3).

`zip(outside, inside)`: Combines corresponding elements from both lists into pairs. The result is:

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]


In [262]:
hier_index = list(zip(outside,inside))


`list(zip(...))`: Converts the zipped object into a list of tuples.

`pd.MultiIndex.from_tuples(hier_index)`: Converts the list of tuples into a MultiIndex object, which allows Pandas to use multiple levels of indexing.

In [264]:
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [266]:
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [268]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,-0.993263,0.1968
G1,2,-1.136645,0.000366
G1,3,1.025984,-0.156598
G2,1,-0.031579,0.649826
G2,2,2.154846,-0.610259
G2,3,-0.755325,-0.346419


In [271]:
# Index Hierarchy
df.loc['G1']

Unnamed: 0,A,B
1,-0.993263,0.1968
2,-1.136645,0.000366
3,1.025984,-0.156598


In [275]:
df.loc['G1'].loc[2]

A   -1.136645
B    0.000366
Name: 2, dtype: float64

In [277]:
# returns the names assigned to the index levels of a DataFrame
df.index.names

FrozenList([None, None])

## Setting Index Names in MultiIndex DataFrames
In a MultiIndex DataFrame, each level of the index can be assigned a name to improve readability and clarity when working with hierarchical data.
Assigns meaningful names to the index levels of the DataFrame.

 - The first level (e.g., 'G1', 'G2') is named "Group".

 - The second level (e.g., 1, 2, 3) is named "Num".

This helps when displaying or referencing the index.

In [288]:
df.index.names = ['Group','Num']

In [290]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,-0.993263,0.1968
G1,2,-1.136645,0.000366
G1,3,1.025984,-0.156598
G2,1,-0.031579,0.649826
G2,2,2.154846,-0.610259
G2,3,-0.755325,-0.346419


## filter by one level of a MultiIndex
`df.xs('G1')` is a method in pandas used to select data at a particular level of a MultiIndex. It is often used when working with a MultiIndex DataFrame to extract data for a specific value from one of the index levels.

`df`: The DataFrame object.

`.xs()`: Stands for "cross-section". It allows you to retrieve rows or columns from a DataFrame at a particular level of a MultiIndex.

`'G1'`: The value at the level you want to select. In this case, 'G1' is an index value from the first level of the index (assuming that you have a MultiIndex with 'G1' as one of the index labels).

`.xs('G1')`: Extracts all rows where the first index level (the 'Group' level) is 'G1'.

The result is a DataFrame containing all rows from 'G1', with the second index level ('Num') retained.

### When to Use .xs()?
When you want to filter by one level of a MultiIndex.

To extract specific data from a hierarchical structure without having to manually filter or slice the DataFrame.

In [292]:
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.993263,0.1968
2,-1.136645,0.000366
3,1.025984,-0.156598


In [299]:
df.xs(1, level='Num')

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,-0.993263,0.1968
G2,-0.031579,0.649826


In [None]:
# ## 2. Reading Data from a CSV File
#
# Pandas can read data from various file formats, including CSV.
# The following code reads a CSV file into a DataFrame.
# Uncomment the line below and replace 'your_file.csv' with the actual file path.

# df = pd.read_csv('your_file.csv')






# ## 7. Grouping and Aggregation
#
# Pandas allows grouping of data to perform aggregate functions.

grouped = df.groupby('Experience').mean()
display(grouped)

# ## 8. Exporting Data
#
# You can export a DataFrame to various formats such as CSV or Excel.

# Export to CSV
df.to_csv('output.csv', index=False)

# Export to Excel
df.to_excel('output.xlsx', index=False)

# ## 9. Summary Statistics
#
# You can quickly get insights into the data using summary functions.

display(df.describe())