# Let's Explore `Pandas` - An Awesome Python Library


**Pandas** is a powerful Python library primarily used for data manipulation and analysis. It's built on top of **NumPy** and provides two main data structures: **Series** and **DataFrame**, which are designed to handle structured data intuitively.

`Official Doc` - https://pandas.pydata.org/docs/user_guide/10min.html

`w3school`     - https://www.w3schools.com/python/pandas/pandas_intro.asp

Let's dive into the details of Pandas:

### 1. **Installing Pandas**
First, you need to install Pandas (if you haven't already):
```bash
pip install pandas
```

### 2. **Importing Pandas**
You typically import Pandas using the alias `pd`:
```python
import pandas as pd
```

### 3. **Pandas Data Structures**

#### **Series**
- A **Series** is a one-dimensional labeled array capable of holding any data type (integer, float, string, etc.). It’s similar to a column in a spreadsheet.
- Each value in a Series is associated with an index, which makes it easy to access values.

##### Example:
```python
import pandas as pd

# Creating a Series from a list
s = pd.Series([1, 3, 5, 7])
print(s)

# Creating a Series with custom index
s_custom = pd.Series([1, 3, 5, 7], index=['a', 'b', 'c', 'd'])
print(s_custom)
```

#### **DataFrame**
- A **DataFrame** is a two-dimensional labeled data structure with columns of potentially different types. It’s similar to a table or a spreadsheet.
- A DataFrame can be created from various data sources, including lists, dictionaries, and files (like CSV or Excel).

##### Example:
```python
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
```

### 4. **Reading Data from Files**
Pandas makes it easy to read data from various file formats like CSV, Excel, etc.

- **CSV files**:
```python
df = pd.read_csv('filename.csv')
```

- **Excel files**:
```python
df = pd.read_excel('filename.xlsx')
```

- **JSON files**:
```python
df = pd.read_json('filename.json')
```

### 5. **Inspecting Data**
Once you have a DataFrame, you’ll often want to inspect the data to understand its structure.

- **Viewing the first few rows**:
```python
df.head()  # Default: shows first 5 rows
```

- **Viewing the last few rows**:
```python
df.tail()  # Default: shows last 5 rows
```

- **Getting basic information**:
```python
df.info()  # Provides a summary of the DataFrame, including data types and missing values
```

- **Descriptive statistics**:
```python
df.describe()  # Gives summary statistics (mean, std, min, etc.) for numeric columns
```

### 6. **Indexing and Selecting Data**

#### **Selecting Columns**
You can select a single column or multiple columns from a DataFrame.

- Single column (returns a Series):
```python
df['Name']
```

- Multiple columns (returns a DataFrame):
```python
df[['Name', 'City']]
```

#### **Selecting Rows**
There are multiple ways to select rows in a DataFrame:

- Using the index:
```python
df.loc[0]  # Selects the first row by label/index
```

- Using integer-based indexing:
```python
df.iloc[0]  # Selects the first row by integer position
```

- Slicing rows:
```python
df.iloc[0:3]  # Selects the first three rows
```

### 7. **Filtering Data**
You can filter the rows of a DataFrame based on a condition.

##### Example:
```python
# Filter rows where Age is greater than 30
df_filtered = df[df['Age'] > 30]
print(df_filtered)
```

You can also filter with multiple conditions using `&` (and), `|` (or).

##### Example:
```python
# Filter rows where Age > 30 and City is 'Chicago'
df_filtered = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
print(df_filtered)
```

### 8. **Modifying Data**

#### **Adding New Columns**
You can add new columns to a DataFrame by directly assigning values.

##### Example:
```python
# Adding a new column 'Salary'
df['Salary'] = [50000, 60000, 70000]
print(df)
```

#### **Updating Values**
You can update values in a DataFrame based on conditions or directly by index.

##### Example:
```python
# Update the 'City' of the first row
df.loc[0, 'City'] = 'San Francisco'
print(df)
```

#### **Dropping Columns or Rows**
You can remove columns or rows using `drop()`.

##### Example:
- Dropping a column:
```python
df = df.drop(columns=['Salary'])
print(df)
```

- Dropping a row:
```python
df = df.drop(0)  # Drops the first row
print(df)
```

### 9. **Handling Missing Data**
Pandas provides tools to handle missing data (NaN values).

- **Checking for missing values**:
```python
df.isnull().sum()  # Shows the number of missing values per column
```

- **Filling missing values**:
```python
df.fillna(value=0)  # Fills NaN values with 0
```

- **Dropping missing values**:
```python
df.dropna()  # Drops rows with any NaN values
```

### 10. **GroupBy and Aggregation**
The `groupby()` function is used to group data based on a column and then apply an aggregation function (like sum, mean, count).

##### Example:
```python
# Group by 'City' and calculate the mean age
df_grouped = df.groupby('City')['Age'].mean()
print(df_grouped)
```

### 11. **Sorting Data**
You can sort the data in a DataFrame by columns using `sort_values()`.

##### Example:
```python
# Sort by Age in descending order
df_sorted = df.sort_values(by='Age', ascending=False)
print(df_sorted)
```

### 12. **Merging and Joining Data**
You can merge or join two DataFrames, similar to SQL joins.

##### Example:
```python
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [50000, 60000, 70000]})

# Merge on 'ID'
df_merged = pd.merge(df1, df2, on='ID')
print(df_merged)
```

### 13. **Pivot Tables**
You can create pivot tables to summarize data.

##### Example:
```python
df = pd.DataFrame({
    'City': ['New York', 'Chicago', 'New York', 'Chicago'],
    'Sales': [200, 150, 300, 250]
})

# Create a pivot table showing the sum of Sales by City
pivot_table = df.pivot_table(values='Sales', index='City', aggfunc='sum')
print(pivot_table)
```

### 14. **Exporting Data**
You can export a DataFrame to various formats, such as CSV, Excel, or JSON.

- **To CSV**:
```python
df.to_csv('output.csv', index=False)
```

- **To Excel**:
```python
df.to_excel('output.xlsx', index=False)
```

### 15. **Advanced Operations**

#### **Apply Functions**
You can apply custom functions to columns or rows using the `apply()` function.

##### Example:
```python
# Apply a custom function to the 'Age' column
df['Age_in_5_years'] = df['Age'].apply(lambda x: x + 5)
print(df)
```

#### **Window Functions**
Pandas supports rolling window operations, which are useful for time series analysis.

##### Example:
```python
# Rolling mean over 2 periods
df['Rolling_mean'] = df['Sales'].rolling(window=2).mean()
print(df)
```

---


# Practice and Special Notes

In [2]:
import pandas as pd
import numpy as np

In [3]:
company_info = {
    'Name' : ["Wasif","Galib","Hasib"],
    'Age'  : [23,21,26],
    'Salary' : [120000,45000,34000]
}
df = pd.DataFrame(company_info)
print(df)

    Name  Age  Salary
0  Wasif   23  120000
1  Galib   21   45000
2  Hasib   26   34000


In [4]:
nums = np.random.randint(1,100,size=[30,5])
ndf = pd.DataFrame(nums, columns=["A",'B','C','D','E'], index=[x for x in range(1,31)])
print(ndf.head())

    A   B   C   D   E
1  50  94  67  72  85
2  33  37  10  86  84
3  89  35  13  52  27
4  79  91  11  88  51
5   1  31  69  55  84


In [5]:
# to get ta data type, null value, size, shape
ndf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, 1 to 30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       30 non-null     int32
 1   B       30 non-null     int32
 2   C       30 non-null     int32
 3   D       30 non-null     int32
 4   E       30 non-null     int32
dtypes: int32(5)
memory usage: 840.0 bytes


In [6]:
# to find the avg, min , max
ndf.describe()

Unnamed: 0,A,B,C,D,E
count,30.0,30.0,30.0,30.0,30.0
mean,52.433333,52.0,47.266667,50.666667,56.1
std,29.602753,30.017236,28.686093,28.673349,25.208578
min,1.0,1.0,4.0,3.0,3.0
25%,26.25,27.25,17.25,24.5,40.5
50%,57.5,48.0,49.5,49.5,55.5
75%,77.5,79.75,72.75,79.0,75.25
max,98.0,94.0,99.0,99.0,99.0


In [7]:
ndf.shape

(30, 5)

In [8]:
# to find the unique values in a column
ndf['A'].unique()

array([50, 33, 89, 79,  1, 70, 76, 54, 34, 15,  7, 62, 23, 78, 74, 56, 91,
       30, 72, 63, 59, 98, 25, 86, 21, 92])

In [9]:
# parquet file is the most efficient file format based on size and speed
res = pd.read_parquet("../dataset/results.parquet")
res

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,
...,...,...,...,...,...,...,...,...,...,...,...
308403,2022.0,Winter,Luge,"Singles, Men (Olympic)",Marián Skupek,148983,SVK,,26.0,False,
308404,2022.0,Winter,Alpine Skiing (Skiing),"Slalom, Women (Olympic)",Elsa Fermbäck,148984,SWE,,28.0,False,
308405,2022.0,Winter,Alpine Skiing (Skiing),"Team, Mixed (Olympic)",Hilma Lövblom,148985,SWE,Sweden,13.0,False,
308406,2022.0,Winter,Alpine Skiing (Skiing),"Giant Slalom, Women (Olympic)",Hilma Lövblom,148985,SWE,,,False,


In [10]:
res.head(3) # head -> first n num rows    tail -> last n num rows

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,


In [11]:
# giving a sample of random rows
res.sample(5)

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
209416,1972.0,Winter,Cross Country Skiing (Skiing),"15 kilometres, Men (Olympic)",Jan Staszel,97775,POL,,33.0,False,
55526,1936.0,Summer,Artistic Gymnastics (Gymnastics),"Team All-Around, Women (Olympic)",Alina Cichecka,28919,POL,Poland,6.0,False,
189847,1948.0,Winter,Alpine Skiing (Skiing),"Downhill, Men (Olympic)",Roberto Lacedelli,88796,ITA,,,False,
267756,2012.0,Summer,Football (Football),"Football, Men (Olympic)",Ahmed Hegazi,125160,EGY,Egypt,8.0,False,
146856,1968.0,Summer,Athletics,"800 metres, Men (Olympic)",Franz-Josef Kemper,70331,FRG,,7.0,False,


In [12]:
res

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,
...,...,...,...,...,...,...,...,...,...,...,...
308403,2022.0,Winter,Luge,"Singles, Men (Olympic)",Marián Skupek,148983,SVK,,26.0,False,
308404,2022.0,Winter,Alpine Skiing (Skiing),"Slalom, Women (Olympic)",Elsa Fermbäck,148984,SWE,,28.0,False,
308405,2022.0,Winter,Alpine Skiing (Skiing),"Team, Mixed (Olympic)",Hilma Lövblom,148985,SWE,Sweden,13.0,False,
308406,2022.0,Winter,Alpine Skiing (Skiing),"Giant Slalom, Women (Olympic)",Hilma Lövblom,148985,SWE,,,False,


In [13]:
# loc - name based filter -> arr.loc[[rows names],[cols names]]
res.loc[[1,2,3,4],['year','type']]

Unnamed: 0,year,type
1,1912.0,Summer
2,1920.0,Summer
3,1920.0,Summer
4,1920.0,Summer


In [14]:
# iloc - index based filter -> arr.iloc[[rows index],[cols index]]
res.iloc[0:5,[0,1,6]]

Unnamed: 0,year,type,noc
0,1912.0,Summer,FRA
1,1912.0,Summer,FRA
2,1920.0,Summer,FRA
3,1920.0,Summer,FRA
4,1920.0,Summer,FRA


In [15]:
# if i change the index of rows into manual index then i must have use loc
res.index = res['type']
res.head()

Unnamed: 0_level_0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Summer,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
Summer,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
Summer,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
Summer,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
Summer,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,


In [16]:
res.loc["Summer",['year','type','event']]

Unnamed: 0_level_0,year,type,event
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Summer,1912.0,Summer,"Singles, Men (Olympic)"
Summer,1912.0,Summer,"Doubles, Men (Olympic)"
Summer,1920.0,Summer,"Singles, Men (Olympic)"
Summer,1920.0,Summer,"Doubles, Mixed (Olympic)"
Summer,1920.0,Summer,"Doubles, Men (Olympic)"
...,...,...,...
Summer,1996.0,Summer,"Water Polo, Men (Olympic)"
Summer,1920.0,Summer,"Coxed Fours, Men (Olympic)"
Summer,2018.0,Summer,"Combined, Boys (YOG)"
Summer,2018.0,Summer,"Sprint, Girls (YOG)"


In [17]:
# Changed value using loc function
res.loc["Summer",['event']] = "Changed Value"
res.head(10)

Unnamed: 0_level_0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Summer,1912.0,Summer,Tennis,Changed Value,Jean-François Blanchy,1,FRA,,17.0,True,
Summer,1912.0,Summer,Tennis,Changed Value,Jean-François Blanchy,1,FRA,Jean Montariol,,False,
Summer,1920.0,Summer,Tennis,Changed Value,Jean-François Blanchy,1,FRA,,32.0,True,
Summer,1920.0,Summer,Tennis,Changed Value,Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
Summer,1920.0,Summer,Tennis,Changed Value,Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,
Summer,1996.0,Summer,Tennis,Changed Value,Arnaud Boetsch,2,FRA,,17.0,True,
Summer,1996.0,Summer,Tennis,Changed Value,Arnaud Boetsch,2,FRA,Guillaume Raoux,17.0,True,
Summer,1924.0,Summer,Tennis,Changed Value,Jean Borotra,3,FRA,,4.0,False,
Summer,1924.0,Summer,Tennis,Changed Value,Jean Borotra,3,FRA,Marguerite Billout,15.0,True,
Summer,1924.0,Summer,Tennis,Changed Value,Jean Borotra,3,FRA,René Lacoste,3.0,False,Bronze


### Sort Values

In [18]:
res.sort_values(['year'])

Unnamed: 0_level_0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Summer,1896.0,Summer,Athletics,Changed Value,Nándor Dáni,71127,HUN,,,False,
Summer,1896.0,Summer,Athletics,Changed Value,Georgios Papasideris,70824,GRE,,3.0,False,Bronze
Summer,1896.0,Summer,Athletics,Changed Value,Tom Curtis,78290,USA,,,False,
Summer,1896.0,Summer,Athletics,Changed Value,Tom Curtis,78290,USA,,1.0,False,Gold
Summer,1896.0,Summer,Athletics,Changed Value,Tom Curtis,78290,USA,,1.0,False,
...,...,...,...,...,...,...,...,...,...,...,...
,,,Fencing,"Sabre, Individual, Men (Olympic)",Lóránt Mészáros,95189,HUN,,5.0,False,
,,,Fencing,"Sabre, Team, Men (Olympic)",Lóránt Mészáros,95189,HUN,Hungary,4.0,False,
,,,Football (Football),"Football, Men (Intercalated)",Georgios Pantos,100811,GRE,Athens,,False,
,,,Football (Football),"Football, Men (Intercalated)",Alexandros Kalafatis,100812,GRE,Athens,,False,


In [58]:
res.sort_values(['year'],ascending=False)


Unnamed: 0_level_0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Winter,2022.0,Winter,,"Slalom, Women (Olympic)",Charlotta Säfvenberg,148986,,,24.0,False,
Winter,2022.0,Winter,Ice Hockey (Ice Hockey),"Ice Hockey, Men (Olympic)",Ronalds Ķēniņš,128130,LAT,Latvia,11.0,False,
Winter,2022.0,Winter,Ice Hockey (Ice Hockey),"Ice Hockey, Men (Olympic)",Kristers Gudļevskis,128127,LAT,Latvia,,False,
Winter,2022.0,Winter,Ice Hockey (Ice Hockey),"Ice Hockey, Men (Olympic)",Ralfs Freibergs,128125,LAT,Latvia,11.0,False,
Winter,2022.0,Winter,Bobsleigh (Bobsleigh),"Four, Open (Olympic)",Oskars Ķibermanis,128118,LAT,Latvia 1,5.0,False,
...,...,...,...,...,...,...,...,...,...,...,...
,,,Fencing,"Sabre, Individual, Men (Olympic)",Lóránt Mészáros,95189,HUN,,5.0,False,
,,,Fencing,"Sabre, Team, Men (Olympic)",Lóránt Mészáros,95189,HUN,Hungary,4.0,False,
,,,Football (Football),"Football, Men (Intercalated)",Georgios Pantos,100811,GRE,Athens,,False,
,,,Football (Football),"Football, Men (Intercalated)",Alexandros Kalafatis,100812,GRE,Athens,,False,


In [68]:
#another way of sorting
res[res["type"] == "Winter"].sort_values(by=["year","discipline"],ascending=False)

Unnamed: 0_level_0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Winter,2022.0,Winter,Speed Skating (Skating),"3,000 metres, Women (Olympic)",Claudia Pechstein,82053,GER,,20.0,False,
Winter,2022.0,Winter,Speed Skating (Skating),"Mass Start, Women (Olympic)",Claudia Pechstein,82053,GER,,9.0,False,
Winter,2022.0,Winter,Speed Skating (Skating),"5,000 metres, Men (Olympic)",Sven Kramer,109681,NED,,9.0,False,
Winter,2022.0,Winter,Speed Skating (Skating),"Mass Start, Men (Olympic)",Sven Kramer,109681,NED,,16.0,False,
Winter,2022.0,Winter,Speed Skating (Skating),"Team Pursuit (8 laps), Men (Olympic)",Sven Kramer,109681,NED,Netherlands,4.0,False,
...,...,...,...,...,...,...,...,...,...,...,...
Winter,1924.0,Winter,Bobsleigh (Bobsleigh),"Four/Five, Men (Olympic)",Victor Verschueren,98621,BEL,Belgium 1,3.0,False,Bronze
Winter,1924.0,Winter,Bobsleigh (Bobsleigh),Ice Hockey (Ice Hockey),Victor Verschueren,98621,BEL,BEL,,True,
Winter,1924.0,Winter,Bobsleigh (Bobsleigh),"Ice Hockey, Men (Olympic)",Victor Verschueren,98621,BEL,Belgium,7.0,True,
Winter,1924.0,Winter,Bobsleigh (Bobsleigh),"Four/Five, Men (Olympic)",Alberto Visconti,98666,ITA,Italy 2,,False,


# Filtering Data

In [20]:
bios = pd.read_csv("../dataset/bios.csv")
bios.sample(10)

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
104960,105989,Alejandra Benítez,1980-07-07,Caracas,Distrito Capital,VEN,Venezuela,169.0,62.0,
71829,72371,Fumiko Ito,1940-03-05,,,,Japan,159.0,49.0,
49552,49909,Tony Portela,1966-04-21,,,,Puerto Rico,168.0,73.0,
91372,92101,Viktor Mamatov,1937-07-21,Belovo,Kemerovo,RUS,Soviet Union,182.0,78.0,2023-10-27
49575,49932,Melania Decuseară,1945-11-22,București (Bucharest),București,ROU,Romania,163.0,52.0,
103723,104715,Irina Laricheva,1964-11-19,Moskva (Moscow),Moskva,RUS,Russian Federation,168.0,72.0,2020-01-29
30542,30777,Trygve Bøyesen,1886-02-15,Skien,Vestfold og Telemark,NOR,Norway,,,1963-07-27
131668,134486,Milad Ebadipour,1993-10-17,Orumiyeh (Urmia),Azarbaijan Gharbi,IRI,Islamic Republic of Iran,202.0,78.0,
89076,89788,Jorge Garbajosa,1977-12-19,Torrejón de Ardoz,Madrid,ESP,Spain,204.0,107.0,
30816,31052,Josef Walter,1901-12-01,,,,Switzerland,,,1973-01-01


In [38]:
# syntex -> arr[arr[rows] filter condition][[column names passsing in a list]]

aus = bios[(bios["NOC"] == "Australia") & (bios["weight_kg"] > 90 )][["athlete_id","name","NOC"]]
aus.head()


Unnamed: 0,athlete_id,name,NOC
300,301,Jackson Fear,Australia
631,634,Mark Philippoussis,Australia
1464,1471,David Hynes,Australia
1465,1472,Sten Lindberg,Australia
1473,1480,Stuart Thompson,Australia


In [49]:
bios[bios["name"].str.contains("tony", case=False)]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
1017,1021,Tony Mmoh,1958-06-14,Enugu,Enugu,NGR,Nigeria,177.0,81.0,
1199,1206,Tony Mancini,1913-01-17,Montréal,Québec,CAN,Canada,,,1990-08-19
3519,3531,Tony Willis,1960-06-17,Liverpool,England,GBR,Great Britain,170.0,64.0,
3521,3533,Tony Wilson,1961-04-15,Wolverhampton,England,GBR,Great Britain,180.0,81.0,
3687,3699,Tony Martey,1944-06-23,Aflao,Volta,GHA,Ghana,160.0,60.0,
...,...,...,...,...,...,...,...,...,...,...
132541,135417,Tony Dodds,1987-06-16,Balclutha,Otago,NZL,New Zealand,183.0,68.0,
137845,141224,Antony,2000-02-24,Osasco,São Paulo,BRA,Brazil,174.0,63.0,
140246,143749,Alex Antony,1994-09-03,,,,India,,,
141963,145550,Tony van Diepen,1996-04-17,,,,Netherlands,,,


In [50]:
country = ['GBR','CAN','USA']
bios[bios["born_country"].isin(country)][["name","born_city"]]

Unnamed: 0,name,born_city
4,Albert Canet,Wandsworth
37,Helen Aitchison,Sunderland
38,Geraldine Beamish,Forest Gate
39,Dora Boothby,Finchley
40,Julie Bradbury,Oxford
...,...,...
145457,Alix Wilkinson,Mammoth Lakes
145461,Kent Johnson,Port Moody
145462,Morgan Ellis,Summerside
145468,Justin Abdelkader,Muskegon


In [52]:
# another ways of filtering 
bios2 = bios.set_index("NOC")
bios2.head()

Unnamed: 0_level_0,athlete_id,name,born_date,born_city,born_region,born_country,height_cm,weight_kg,died_date
NOC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
France,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,,,1960-10-02
France,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,183.0,76.0,
France,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,183.0,76.0,1994-07-17
France,4,Jacques Brugnon,1895-05-11,Paris VIIIe,Paris,FRA,168.0,64.0,1978-03-20
France,5,Albert Canet,1878-04-17,Wandsworth,England,GBR,,,1930-07-25


In [56]:
bios2.filter(like="United", axis=0) #default axis 1

Unnamed: 0_level_0,athlete_id,name,born_date,born_city,born_region,born_country,height_cm,weight_kg,died_date
NOC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Georgia Unified Team United States,504,Khatuna Kvrivishvili-Lorig,1974-01-01,Tbilisi,Tbilisi,GEO,170.0,64.0,
People's Republic of China United States,783,Jun Gao,1969-01-25,Baoding,Hebei,CHN,168.0,73.0,
United States,1352,Laura Berg,1975-01-06,Santa Fe Springs,California,USA,168.0,61.0,
United States,1353,Gillian Boxx,1973-09-01,Fontana,California,USA,170.0,,
United States,1363,Sheila Cornell-Douty,1962-02-26,Encino,California,USA,175.0,81.0,
...,...,...,...,...,...,...,...,...,...
United States,149169,Corinne Stoddard,2001-08-15,Seattle,Washington,USA,,,
United States,149170,Andrew Heo,2001-03-07,,,,,,
United States,149180,Anna Hoffmann,2000-03-28,Madison,Wisconsin,USA,,,
United States,149183,Alix Wilkinson,2000-08-02,Mammoth Lakes,California,USA,,,


In [57]:
bios2.filter(items=['name','born_date'])

Unnamed: 0_level_0,name,born_date
NOC,Unnamed: 1_level_1,Unnamed: 2_level_1
France,Jean-François Blanchy,1886-12-12
France,Arnaud Boetsch,1969-04-01
France,Jean Borotra,1898-08-13
France,Jacques Brugnon,1895-05-11
France,Albert Canet,1878-04-17
...,...,...
ROC,Polina Luchnikova,2002-01-30
ROC,Valeriya Merkusheva,1999-09-20
ROC,Yuliya Smirnova,1998-05-08
France,André Foussard,1899-05-19


# Indexing

In [75]:
bios.set_index("name",inplace=True)
bios

Unnamed: 0_level_0,athlete_id,born_date,born_region,born_country,NOC,height_cm,weight_kg,died_date
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Jean-François Blanchy,1,1886-12-12,Gironde,FRA,France,,,1960-10-02
Arnaud Boetsch,2,1969-04-01,Yvelines,FRA,France,183.0,76.0,
Jean Borotra,3,1898-08-13,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17
Jacques Brugnon,4,1895-05-11,Paris,FRA,France,168.0,64.0,1978-03-20
Albert Canet,5,1878-04-17,England,GBR,France,,,1930-07-25
...,...,...,...,...,...,...,...,...
Polina Luchnikova,149222,2002-01-30,Sverdlovsk,RUS,ROC,167.0,61.0,
Valeriya Merkusheva,149223,1999-09-20,Moskva,RUS,ROC,168.0,65.0,
Yuliya Smirnova,149224,1998-05-08,Arkhangelsk,RUS,ROC,163.0,55.0,
André Foussard,149225,1899-05-19,Deux-Sèvres,FRA,France,166.0,,1986-03-18


In [89]:
bios.reset_index(inplace=True)
bios

Unnamed: 0,born_country,born_region,index,name,athlete_id,born_date,NOC,height_cm,weight_kg,died_date
0,FRA,Gironde,0,Jean-François Blanchy,1,1886-12-12,France,,,1960-10-02
1,FRA,Yvelines,1,Arnaud Boetsch,2,1969-04-01,France,183.0,76.0,
2,FRA,Pyrénées-Atlantiques,2,Jean Borotra,3,1898-08-13,France,183.0,76.0,1994-07-17
3,FRA,Paris,3,Jacques Brugnon,4,1895-05-11,France,168.0,64.0,1978-03-20
4,GBR,England,4,Albert Canet,5,1878-04-17,France,,,1930-07-25
...,...,...,...,...,...,...,...,...,...,...
145495,RUS,Sverdlovsk,145495,Polina Luchnikova,149222,2002-01-30,ROC,167.0,61.0,
145496,RUS,Moskva,145496,Valeriya Merkusheva,149223,1999-09-20,ROC,168.0,65.0,
145497,RUS,Arkhangelsk,145497,Yuliya Smirnova,149224,1998-05-08,ROC,163.0,55.0,
145498,FRA,Deux-Sèvres,145498,André Foussard,149225,1899-05-19,France,166.0,,1986-03-18


In [90]:
# composite key or index

bios.set_index(["born_country","born_region"],inplace=True)
bios

Unnamed: 0_level_0,Unnamed: 1_level_0,index,name,athlete_id,born_date,NOC,height_cm,weight_kg,died_date
born_country,born_region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
FRA,Gironde,0,Jean-François Blanchy,1,1886-12-12,France,,,1960-10-02
FRA,Yvelines,1,Arnaud Boetsch,2,1969-04-01,France,183.0,76.0,
FRA,Pyrénées-Atlantiques,2,Jean Borotra,3,1898-08-13,France,183.0,76.0,1994-07-17
FRA,Paris,3,Jacques Brugnon,4,1895-05-11,France,168.0,64.0,1978-03-20
GBR,England,4,Albert Canet,5,1878-04-17,France,,,1930-07-25
...,...,...,...,...,...,...,...,...,...
RUS,Sverdlovsk,145495,Polina Luchnikova,149222,2002-01-30,ROC,167.0,61.0,
RUS,Moskva,145496,Valeriya Merkusheva,149223,1999-09-20,ROC,168.0,65.0,
RUS,Arkhangelsk,145497,Yuliya Smirnova,149224,1998-05-08,ROC,163.0,55.0,
FRA,Deux-Sèvres,145498,André Foussard,149225,1899-05-19,France,166.0,,1986-03-18


In [85]:
# sort index 
bios.sort_index(ascending=[False,False])
bios.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,index,name,athlete_id,born_date,NOC,height_cm,weight_kg,died_date
born_country,born_region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
FRA,Gironde,0,Jean-François Blanchy,1,1886-12-12,France,,,1960-10-02
FRA,Yvelines,1,Arnaud Boetsch,2,1969-04-01,France,183.0,76.0,
FRA,Pyrénées-Atlantiques,2,Jean Borotra,3,1898-08-13,France,183.0,76.0,1994-07-17
FRA,Paris,3,Jacques Brugnon,4,1895-05-11,France,168.0,64.0,1978-03-20
GBR,England,4,Albert Canet,5,1878-04-17,France,,,1930-07-25
