# Pandas Tutorial 1: What is Pandas

If you are working in data science, it is essential that you are familiar with the Python Pandas Module. Pandas, combined with Python, makes Data Science and Analytics both straightforward and powerful. 

#### Topics covered:
* **Introduction to Data Science and Data Analytics**
* **What is Pandas?**
* **A Walkthrough of Basic Pandas Functionality**
* **Installing Pandas**

### What is Data Science?

Data Science, or data analytics, involves the extraction of insights from large and complex datasets through various analytical methods. This process often includes:
* data collection
* cleaning
* analysis
* visualization
* interpretation

to answer specific questions or solve problems related to the data.

### What is Pandas?

Pandas is a powerful and versatile Python library specifically designed for data manipulation and analysis. It provides data structures like **DataFrames** and **Series**, making it easy to handle and analyze structured data. Pandas is widely used in data science for tasks such as data cleaning, transformation, and exploration, making the data analysis process more efficient and effective.

### Pandas Features
* **DataFrames**: Two-dimensional labeled data structures, similar to tables in a relational database  or Excel spreadsheet.
* **Series**: One-dimensional labeled arrays, which can be thought of as a single column in a DataFrame.
* **Built-in Functions**: Functions for data manipulation like:
     - **`groupby`**
     - **`merge`**
     - **`concat`**
     - **`pivot`**

**Explore the following using Pandas** (Load data from the CSV file `nyc_weather`):
- **Maximum Temperature:** What was the highest temperature recorded in New York during January?
- **Rainy Days:** On which days did it rain?
- **Average Wind Speed:** What was the average wind speed throughout the month?

To work with this data, you would typically load the dataset into a Panas DataFrame and use Pandas functions to analyze it

**Example**:

```python
import pandas as pd

df = pd.read_csv('nyc_weather.csv') # Load data from CSV file

max_temp = df['Temperature'].max() # Find maximum temperature

rainy_days = df[df['Precipitation'] > 0]['Date'] # Find days it rained

average_wind_speed = df['Wind Speed'].mean() # Calculate average wind speed

```

## Data Munging and Data Wrangling

The process of cleaning, transforming, and preparing messy or unstructured data for analysis is known as **data munging** or **data wrangling**. This process is crucial to ensure the accuracy and quality of your analysis. One common technique in data munging is ***filling in missing values***, which can be done using the following code:

```python
df.fillna(0, inplace=True) # Replace missing values with `0`
```

This command replaces all missing values in the DataFrame with zeros, which can help prevent errors in your analysis.

In [1]:
import pandas as pd
df = pd.read_csv('C:\\Users\\Vaishob\\PycharmProjects\\pandas\\nyc_weather.csv')
df

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,1/1/2016,38,23,52,30.03,10,8.0,0,5,,281
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275
2,1/3/2016,40,21,47,29.86,10,8.0,0,1,,277
3,1/4/2016,25,9,44,30.05,10,9.0,0,3,,345
4,1/5/2016,20,-3,41,30.57,10,5.0,0,0,,333
5,1/6/2016,33,4,35,30.5,10,4.0,0,0,,259
6,1/7/2016,39,11,33,30.28,10,2.0,0,3,,293
7,1/8/2016,39,29,64,30.2,10,4.0,0,8,,79
8,1/9/2016,44,38,77,30.16,9,8.0,T,8,Rain,76
9,1/10/2016,50,46,71,29.59,4,,1.8,7,Rain,109


### Finding the Maximum Temperature

To determine the maximum temperature recorded in New York during January, you can use the **`max()`** method:

In [2]:
# What is the max temp
df['Temperature'].max()

50

This indicates that the maximum temperature in New York during the month of January was **50 degrees**.

### Dates on Which It Rained

To find out the dates when it rained in New York during January, you can query the DataFrame like this:
```python
df[xxx][*CONDITION*]
```

In [3]:
# Dates on which it rained
df['EST'][df['Events']=='Rain']

8      1/9/2016
9     1/10/2016
15    1/16/2016
26    1/27/2016
Name: EST, dtype: object

This indicates that it rained on **January 9th**, **10th**, **16th** and **27th**.

### Calculating the Average Wind Speed

To calculate the average wind speed during January, you can use the following steps:
- First, fill any missing values in the DataFrame to ensure that they do not affect the calculation. 
- Then, compute the mean of the **`WindSpeedMPH`** column:

In [2]:
# What is the avg wind speed
df.fillna(0, inplace=True)
df['WindSpeedMPH'].mean()

6.225806451612903

This indicates that the average wind speed in New York during January was approximately **6.225 MPH**.

## Additional Details

### Data Science Process
In practice, data science involves multiple stages, including:
* **Data Collection**: Gathering raw data from various sources
* **Data Cleaning**: Removing inconsistencies, missing values, and errors.
* **Data Exploration**: Identifying patterns, trends, and anomalies.
* **Data Modeling**: Applying statistical or machine learning models to make predicitons or decisions.
* **Data Visualization**: Creating graphs and charts to communicate insights.
* **Interpretation**: Drawing conclusions and making data-driven decisions

### Pandas Features
* **DataFrames**: Two-dimensional labeled data structures, similar to tables in a relational database  or Excel spreadsheet.
* **Series**: One-dimensional labeled arrays, which can be thought of as a single column in a DataFrame.
* **Built-in Functions**: Functions for data manipulation like:
     - **`groupby`**
     - **`merge`**
     - **`concat`**
     - **`pivot`**