# Pandas Tutorial 1: What is Pandas

In the last tutorial, we saw how hard it is to extract insights from data without a tool like Pandas.

Pandas is a powerful Python library that simplifies working with data, making it essential for data science. When paired with Python, it streamlines data analysis.

#### Topics covered:
* **Intro to Data Science and Data Analytics**
* **What is Pandas?**
* **Basic Pandas Functions**

This tutorial lays the groundwork for using Pandas to tackle real-world data challenges efficiently.

### What is Data Science?

Data Science involves extracting insights from large datasets using methods like:
* data collection
* cleaning
* analysis
* visualization
* interpretation

It helps answer questions or solve problems based on the data.

### What is Pandas?

Pandas is a Python library for data manipulation and analysis. It offers structures like **DataFrames** and **Series** that make handling structured data easier. Pandas is essential in data science for cleaning, transforming, and exploring data efficiently.

### Pandas Features
* **DataFrames**: Two-dimensional labeled data structures, like tables or spreadsheets.
* **Series**: One-dimensional labeled arrays, similar to a column in a DataFrame
* **Built-in Functions**: 
     - **`groupby`** - Groups data by columns for aggregation.
     - **`merge`** - Combines DataFrames based on common columns, like SQL joins.
     - **`concat`** - Concatenates DataFrames or Series along a specific axis.
     - **`pivot`** - Reshapes data by restructuring tables based on column values. 

**Explore the following using Pandas** (Load data from the CSV file `nyc_weather`):
- **Maximum Temperature:** Find the highest temperature recorded in New York during January.
- **Rainy Days:** Identify the days it rained.
- **Average Wind Speed:** Calculate the average wind speed for the month.

To analyze this data, load the dataset into a Pandas DataFrame and use the relevant Pandas functions.

## Data Munging - `fillna()`

Data munging, or data wrangling, involves cleaning and preparing messy data for analysis. This step is crucial for ensuring data accuracy. One common technique is **filling missing values**, which can be done with the `.fillna(0)` method:

```python
df.fillna(0, inplace=True) # Replace missing values with `0`
```

This code replaces all missing values in the DataFrame with zeros to prevent errors during analysis.

In [1]:
import pandas as pd
df = pd.read_csv('C:\\Users\\Vaishob\\PycharmProjects\\pandas\\nyc_weather.csv')
df

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,1/1/2016,38,23,52,30.03,10,8.0,0,5,,281
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275
2,1/3/2016,40,21,47,29.86,10,8.0,0,1,,277
3,1/4/2016,25,9,44,30.05,10,9.0,0,3,,345
4,1/5/2016,20,-3,41,30.57,10,5.0,0,0,,333
5,1/6/2016,33,4,35,30.5,10,4.0,0,0,,259
6,1/7/2016,39,11,33,30.28,10,2.0,0,3,,293
7,1/8/2016,39,29,64,30.2,10,4.0,0,8,,79
8,1/9/2016,44,38,77,30.16,9,8.0,T,8,Rain,76
9,1/10/2016,50,46,71,29.59,4,,1.8,7,Rain,109


### Finding the Maximum Temperature - `max()`

To find the highest temperature recorded in New York during January, you can use Pandas `max()`:

In [2]:
# What is the max temp
df['Temperature'].max()

50

The max temperature in New York during January was **50 degrees**.

### Dates It Rained - `df[condition]`

To find the dates when it rained in New York during January, you can use **Boolean indexing**: 

```python
df[df['Events']=='Rain']['EST']
```

**Note:** You first apply the condition to filter rows, ie `[df['Events']=='Rain']` and then select the desired column.

In [4]:
# Dates on which it rained
df[df['Events']=='Rain']['EST']

8      1/9/2016
9     1/10/2016
15    1/16/2016
26    1/27/2016
Name: EST, dtype: object

We observe that it rained on **January 9th**, **10th**, **16th** and **27th**.

### Calculating Average Wind Speed - `mean()`

To calculate January's average wind speed:
* Fill missing values in the DataFrame.
* Compute the mean of the `WindSpeedMPH` column.

In [2]:
# What is the avg wind speed
df.fillna(0, inplace=True)
df['WindSpeedMPH'].mean()

6.225806451612903

The average wind speed in New York during January was about **6.225 MPH**.

## Data Science Process

Data science typically includes several key stages:
* **Data Collection:** Gathering raw data from various sources.
* **Data Cleaning:** Removing inconsistencies, filling missing values, and correcting errors. 
* **Data Exploration:** Identifying patterns, trends, and anomalies in the data. 
* **Data Modelling:** Applying statistical or machine learning models for predictions or decisions.
* **Data Visualization:** Creating charts to present insights.
* **Interpretation:** Drawing conclusions to inform data-driven decisions.

### Pandas Features
* **DataFrames:** Two-dimensional data structures similar to database tables or Excel sheets.
* **Series:** One-dimensional labeled arrays, like a single columns in a DataFrame.
* **Built-in Functions:** Useful for data manipulation, such as:
    * `groupby`
    * `merge`
    * `concat`
    * `pivot`