# Let's Explore `Pandas` - An Awesome Python Library


**Pandas** is a powerful Python library primarily used for data manipulation and analysis. It's built on top of **NumPy** and provides two main data structures: **Series** and **DataFrame**, which are designed to handle structured data intuitively.

`Official Doc` - https://pandas.pydata.org/docs/user_guide/10min.html

`w3school`     - https://www.w3schools.com/python/pandas/pandas_intro.asp

Let's dive into the details of Pandas:

### 1. **Installing Pandas**
First, you need to install Pandas (if you haven't already):
```bash
pip install pandas
```

### 2. **Importing Pandas**
You typically import Pandas using the alias `pd`:
```python
import pandas as pd
```

### 3. **Pandas Data Structures**

#### **Series**
- A **Series** is a one-dimensional labeled array capable of holding any data type (integer, float, string, etc.). It’s similar to a column in a spreadsheet.
- Each value in a Series is associated with an index, which makes it easy to access values.

##### Example:
```python
import pandas as pd

# Creating a Series from a list
s = pd.Series([1, 3, 5, 7])
print(s)

# Creating a Series with custom index
s_custom = pd.Series([1, 3, 5, 7], index=['a', 'b', 'c', 'd'])
print(s_custom)
```

#### **DataFrame**
- A **DataFrame** is a two-dimensional labeled data structure with columns of potentially different types. It’s similar to a table or a spreadsheet.
- A DataFrame can be created from various data sources, including lists, dictionaries, and files (like CSV or Excel).

##### Example:
```python
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
```

### 4. **Reading Data from Files**
Pandas makes it easy to read data from various file formats like CSV, Excel, etc.

- **CSV files**:
```python
df = pd.read_csv('filename.csv')
```

- **Excel files**:
```python
df = pd.read_excel('filename.xlsx')
```

- **JSON files**:
```python
df = pd.read_json('filename.json')
```

### 5. **Inspecting Data**
Once you have a DataFrame, you’ll often want to inspect the data to understand its structure.

- **Viewing the first few rows**:
```python
df.head()  # Default: shows first 5 rows
```

- **Viewing the last few rows**:
```python
df.tail()  # Default: shows last 5 rows
```

- **Getting basic information**:
```python
df.info()  # Provides a summary of the DataFrame, including data types and missing values
```

- **Descriptive statistics**:
```python
df.describe()  # Gives summary statistics (mean, std, min, etc.) for numeric columns
```

### 6. **Indexing and Selecting Data**

#### **Selecting Columns**
You can select a single column or multiple columns from a DataFrame.

- Single column (returns a Series):
```python
df['Name']
```

- Multiple columns (returns a DataFrame):
```python
df[['Name', 'City']]
```

#### **Selecting Rows**
There are multiple ways to select rows in a DataFrame:

- Using the index:
```python
df.loc[0]  # Selects the first row by label/index
```

- Using integer-based indexing:
```python
df.iloc[0]  # Selects the first row by integer position
```

- Slicing rows:
```python
df.iloc[0:3]  # Selects the first three rows
```

### 7. **Filtering Data**
You can filter the rows of a DataFrame based on a condition.

##### Example:
```python
# Filter rows where Age is greater than 30
df_filtered = df[df['Age'] > 30]
print(df_filtered)
```

You can also filter with multiple conditions using `&` (and), `|` (or).

##### Example:
```python
# Filter rows where Age > 30 and City is 'Chicago'
df_filtered = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
print(df_filtered)
```

### 8. **Modifying Data**

#### **Adding New Columns**
You can add new columns to a DataFrame by directly assigning values.

##### Example:
```python
# Adding a new column 'Salary'
df['Salary'] = [50000, 60000, 70000]
print(df)
```

#### **Updating Values**
You can update values in a DataFrame based on conditions or directly by index.

##### Example:
```python
# Update the 'City' of the first row
df.loc[0, 'City'] = 'San Francisco'
print(df)
```

#### **Dropping Columns or Rows**
You can remove columns or rows using `drop()`.

##### Example:
- Dropping a column:
```python
df = df.drop(columns=['Salary'])
print(df)
```

- Dropping a row:
```python
df = df.drop(0)  # Drops the first row
print(df)
```

### 9. **Handling Missing Data**
Pandas provides tools to handle missing data (NaN values).

- **Checking for missing values**:
```python
df.isnull().sum()  # Shows the number of missing values per column
```

- **Filling missing values**:
```python
df.fillna(value=0)  # Fills NaN values with 0
```

- **Dropping missing values**:
```python
df.dropna()  # Drops rows with any NaN values
```

### 10. **GroupBy and Aggregation**
The `groupby()` function is used to group data based on a column and then apply an aggregation function (like sum, mean, count).

##### Example:
```python
# Group by 'City' and calculate the mean age
df_grouped = df.groupby('City')['Age'].mean()
print(df_grouped)
```

### 11. **Sorting Data**
You can sort the data in a DataFrame by columns using `sort_values()`.

##### Example:
```python
# Sort by Age in descending order
df_sorted = df.sort_values(by='Age', ascending=False)
print(df_sorted)
```

### 12. **Merging and Joining Data**
You can merge or join two DataFrames, similar to SQL joins.

##### Example:
```python
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [50000, 60000, 70000]})

# Merge on 'ID'
df_merged = pd.merge(df1, df2, on='ID')
print(df_merged)
```

### 13. **Pivot Tables**
You can create pivot tables to summarize data.

##### Example:
```python
df = pd.DataFrame({
    'City': ['New York', 'Chicago', 'New York', 'Chicago'],
    'Sales': [200, 150, 300, 250]
})

# Create a pivot table showing the sum of Sales by City
pivot_table = df.pivot_table(values='Sales', index='City', aggfunc='sum')
print(pivot_table)
```

### 14. **Exporting Data**
You can export a DataFrame to various formats, such as CSV, Excel, or JSON.

- **To CSV**:
```python
df.to_csv('output.csv', index=False)
```

- **To Excel**:
```python
df.to_excel('output.xlsx', index=False)
```

### 15. **Advanced Operations**

#### **Apply Functions**
You can apply custom functions to columns or rows using the `apply()` function.

##### Example:
```python
# Apply a custom function to the 'Age' column
df['Age_in_5_years'] = df['Age'].apply(lambda x: x + 5)
print(df)
```

#### **Window Functions**
Pandas supports rolling window operations, which are useful for time series analysis.

##### Example:
```python
# Rolling mean over 2 periods
df['Rolling_mean'] = df['Sales'].rolling(window=2).mean()
print(df)
```

---


In [4]:
import pandas as pd
import numpy as np

In [5]:
company_info = {
    'Name' : ["Wasif","Galib","Hasib"],
    'Age'  : [23,21,26],
    'Salary' : [120000,45000,34000]
}
df = pd.DataFrame(company_info)
print(df)

    Name  Age  Salary
0  Wasif   23  120000
1  Galib   21   45000
2  Hasib   26   34000


In [6]:
nums = np.random.randint(1,100,size=[30,5])
ndf = pd.DataFrame(nums, columns=["A",'B','C','D','E'], index=[x for x in range(1,31)])
print(ndf.head())

    A   B   C   D   E
1  75  80  79  69  27
2  89   7  27  18  96
3  13  65  29  89  19
4  38  57  29  67  95
5  55  26  23  18  60


In [7]:
# to get ta data type, null value, size, shape
ndf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, 1 to 30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       30 non-null     int32
 1   B       30 non-null     int32
 2   C       30 non-null     int32
 3   D       30 non-null     int32
 4   E       30 non-null     int32
dtypes: int32(5)
memory usage: 840.0 bytes


In [8]:
# to find the avg, min , max
ndf.describe()

Unnamed: 0,A,B,C,D,E
count,30.0,30.0,30.0,30.0,30.0
mean,46.2,50.066667,57.4,52.866667,50.666667
std,27.704599,27.615505,27.555086,30.599452,28.826871
min,2.0,7.0,7.0,3.0,2.0
25%,23.5,30.0,30.0,18.0,25.5
50%,45.5,48.0,62.5,60.0,51.5
75%,67.75,72.0,81.0,80.0,70.75
max,98.0,98.0,97.0,98.0,96.0


In [9]:
ndf.shape

(30, 5)

In [10]:
# to find the unique values in a column
ndf['A'].unique()

array([75, 89, 13, 38, 55, 81, 64, 25, 48, 34, 28, 69, 43, 23,  2, 91, 52,
       54,  8, 15, 18, 53, 86, 17, 98, 16, 33])

In [11]:
# parquet file is the most efficient file format based on size and speed
res = pd.read_parquet("dataset/results.parquet")
res

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,
...,...,...,...,...,...,...,...,...,...,...,...
308403,2022.0,Winter,Luge,"Singles, Men (Olympic)",Marián Skupek,148983,SVK,,26.0,False,
308404,2022.0,Winter,Alpine Skiing (Skiing),"Slalom, Women (Olympic)",Elsa Fermbäck,148984,SWE,,28.0,False,
308405,2022.0,Winter,Alpine Skiing (Skiing),"Team, Mixed (Olympic)",Hilma Lövblom,148985,SWE,Sweden,13.0,False,
308406,2022.0,Winter,Alpine Skiing (Skiing),"Giant Slalom, Women (Olympic)",Hilma Lövblom,148985,SWE,,,False,


In [16]:
res.head(3) # head -> first n num rows    tail -> last n num rows

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,


In [17]:
# giving a sample of random rows
res.sample(5)

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
93576,1968.0,Summer,Shooting,"Rapid-Fire Pistol, 25 metres, Open (Olympic)",Virgil Atanasiu,43867,ROU,,21.0,False,
159883,1900.0,Summer,Athletics,"Triple Jump, Men (Olympic)",Karl Gustaf Staaf,76397,SWE,,,False,
221638,2006.0,Winter,Ice Hockey (Ice Hockey),"Ice Hockey, Women (Olympic)",Cherie Piper,102080,CAN,Canada,1.0,False,Gold
26209,1976.0,Summer,Cycling Road (Cycling),"Road Race, Individual, Men (Olympic)",Hamblin González,14532,NCA,,,False,
94508,1952.0,Summer,Shooting,"Small-Bore Rifle, Three Positions, 50 metres, ...",Walther Fröstell,44183,SWE,,14.0,False,
