## Pandas - Complete Short Notes

### Overview
- **Pandas**: An open-source data manipulation and analysis library for Python.
- Built on top of **NumPy**.
- Provides two main data structures: **Series** and **DataFrame**.

### Installation
```sh
pip install pandas
```

### Key Data Structures
1. **Series**
   - 1D labeled array.
   - Creation: `pd.Series(data, index)`

2. **DataFrame**
   - 2D labeled data structure.
   - Creation: `pd.DataFrame(data, columns, index)`

### Data Loading
- **CSV**: `pd.read_csv('file.csv')`
- **Excel**: `pd.read_excel('file.xlsx')`
- **SQL**: `pd.read_sql('SELECT * FROM table', connection)`
- **JSON**: `pd.read_json('file.json')`
- **Clipboard**: `pd.read_clipboard()`
- **Parquet**: `pd.read_parquet('file.parquet')`
- **HTML**: `pd.read_html('file.html')`
- **Feather**: `pd.read_feather('file.feather')`
- **ORC**: `pd.read_orc('file.orc')`
- **STATA**: `pd.read_stata('file.dta')`
- **SAS**: `pd.read_sas('file.sas7bdat')`
- **SPSS**: `pd.read_spss('file.sav')`
- **HDF5**: `pd.read_hdf('file.h5', key='df')`
- **Pickle**: `pd.read_pickle('file.pkl')`

### Data Exporting
- **CSV**: `df.to_csv('file.csv', index=False)`
- **Excel**: `df.to_excel('file.xlsx', index=False)`
- **SQL**: `df.to_sql('table', connection)`
- **JSON**: `df.to_json('file.json')`
- **Clipboard**: `df.to_clipboard()`
- **Parquet**: `df.to_parquet('file.parquet')`
- **HTML**: `df.to_html('file.html')`
- **Feather**: `df.to_feather('file.feather')`
- **ORC**: `df.to_orc('file.orc')`
- **STATA**: `df.to_stata('file.dta')`
- **Pickle**: `df.to_pickle('file.pkl')`

### Data Inspection
- **Top rows**: `df.head(n)`
- **Bottom rows**: `df.tail(n)`
- **Shape**: `df.shape`
- **Info**: `df.info()`
- **Summary statistics**: `df.describe()`
- **Columns**: `df.columns`
- **Index**: `df.index`
- **Values**: `df.values`

### Selecting Data
- **Column selection**: `df['column']`, `df[['col1', 'col2']]`
- **Row selection**:
  - By label: `df.loc['label']`
  - By position: `df.iloc[pos]`
  - By label and position: `df.at['label', 'column']`, `df.iat[pos, col_index]`
- **Conditional selection**: `df[df['column'] > value]`
- **Setting index**: `df.set_index('column', inplace=True)`
- **Resetting index**: `df.reset_index(drop=True, inplace=True)`

### Data Manipulation
- **Add column**: `df['new_col'] = data`
- **Drop column**: `df.drop(columns=['col1'])`
- **Rename columns**: `df.rename(columns={'old_name': 'new_name'})`
- **Apply function**: `df['col'].apply(func)`, `df.applymap(func)`
- **Map values**: `df['col'].map(mapping_dict)`
- **Replace values**: `df.replace(to_replace, value)`
- **String operations**: `df['col'].str.upper()`, `df['col'].str.contains('pattern')`

### Data Cleaning
- **Missing values**:
  - Detect: `df.isna()`, `df.notna()`
  - Drop: `df.dropna(subset=['col'])`
  - Fill: `df.fillna(value)`
  - Interpolate: `df.interpolate(method='linear')`
- **Duplicates**:
  - Detect: `df.duplicated()`
  - Drop: `df.drop_duplicates()`
- **Data types**: `df.dtypes`, `df.astype(new_type)`
- **Replace values**: `df.replace(to_replace, value)`

### Data Transformation
- **Sorting**: `df.sort_values(by='col', ascending=False)`
- **Pivoting**:
  - Pivot: `df.pivot(index='col1', columns='col2', values='col3')`
  - Melt: `df.melt(id_vars=['col1'], value_vars=['col2'])`
- **Grouping**:
  - Group by: `df.groupby('col').agg({'col2': 'mean'})`
  - Apply: `df.groupby('col').apply(func)`
  - Transform: `df.groupby('col')['col2'].transform(func)`
- **Concatenation**: `pd.concat([df1, df2], axis=0)`
- **Merging**: `pd.merge(df1, df2, on='key')`
- **Joining**: `df1.join(df2.set_index('key'), on='key')`
- **Stack/Unstack**: `df.stack()`, `df.unstack()`

### Statistical Functions
- **Aggregations**: `df.sum()`, `df.mean()`, `df.median()`, `df.std()`, `df.min()`, `df.max()`, `df.count()`, `df.var()`, `df.sem()`, `df.describe()`
- **Cumulative**: `df.cumsum()`, `df.cumprod()`
- **Rolling**: `df.rolling(window=n).mean()`
- **Expanding**: `df.expanding().mean()`
- **Quantile**: `df.quantile(q=0.5)`
- **Correlation**: `df.corr()`
- **Covariance**: `df.cov()`

### Time Series
- **Datetime conversion**: `pd.to_datetime(df['col'])`
- **Setting datetime index**: `df.set_index(pd.to_datetime(df['date_col']))`
- **Resampling**: `df.resample('M').mean()`
- **Time-based indexing**: `df['2021']`, `df['2021-01':'2021-06']`
- **Rolling window**: `df.rolling(window=3).sum()`
- **Shifting data**: `df.shift(periods=1)`
- **Lag/Lead**: `df['col'].shift(1)`, `df['col'].shift(-1)`

### Visualization
- **Plotting**: `df.plot()`, `df['col'].plot(kind='bar')`, `df.plot(kind='scatter', x='col1', y='col2')`
- **Histograms**: `df['col'].hist()`
- **Box plots**: `df.boxplot(column='col', by='group')`
- **Pie chart**: `df['col'].value_counts().plot.pie()`

### Advanced Topics
- **Categorical Data**: `df['col'] = df['col'].astype('category')`
- **Sparse Data**: `pd.SparseDataFrame()`
- **MultiIndex**: `df.set_index(['col1', 'col2'])`
- **Window Operations**: `df.rolling(window=3).sum()`
- **Expanding**: `df.expanding(min_periods=1).mean()`
- **EWM (Exponentially Weighted)**: `df.ewm(span=3).mean()`

### Resources
- **Documentation**: [Pandas Documentation](https://pandas.pydata.org/docs/)
- **Cheat Sheet**: [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- **Tutorials**: [Pandas Tutorials on DataCamp](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)

### Example Code
#### Creating DataFrame and Basic Operations
```python
import pandas as pd

# Create DataFrame
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 35]}
df = pd.DataFrame(data)

# View DataFrame
print(df.head())

# Basic statistics
print(df['Age'].mean())

# Add a column
df['Gender'] = ['Male', 'Female', 'Male']

# Group by and aggregate
grouped_df = df.groupby('Gender').mean()

# Handle missing values
df.fillna(0, inplace=True)

# Export to CSV
df.to_csv('output.csv', index=False)
```

### Best Practices
- **Use appropriate data types**: Convert columns to appropriate types (e.g., `category` for categorical data).
- **Handle missing data early**: Use `dropna()` and `fillna()` methods.
- **Vectorized operations**: Prefer vectorized operations over loops for better performance.


- **Document your code**: Add comments and docstrings for better readability.
- **Chain operations**: Use method chaining for cleaner code (e.g., `df.dropna().groupby('col').sum()`).

refer to the official [pandas documentation](https://pandas.pydata.org/docs/).