# Pandas is a powerful Python library used for data analysis and manipulation.

## Here's a quick overview of Pandas:

### What it does:

* Loads and cleans datasets of various formats (CSV, Excel, SQL databases, etc.).

* Creates and manipulates data structures like DataFrames (similar to spreadsheets) and Series (single arrays).

* Performs data analysis tasks like filtering, sorting, grouping, aggregating, and statistical calculations.

* Enables data visualization through built-in plotting functions and integration with other libraries like Matplotlib.

### Why it's popular:

* Easy to learn: User-friendly syntax and extensive documentation make it accessible to users of all levels.

* Powerful and versatile: Handles a wide range of data types and analysis tasks.

* Integrates well with other libraries: Works seamlessly with popular scientific computing libraries like NumPy and SciPy.

* Open-source and community-driven: Continuously improving with active development and a helpful community.

### Who uses it:

* Data scientists, analysts, and researchers.

* Financial analysts and economists.

* Machine learning engineers and developers.

* Anyone who needs to work with and analyze data effectively.

## What is Data Frames?

DataFrames are the **backbone of Pandas**, serving as the primary data structure for holding and manipulating data. Think of them as **flexible, multi-dimensional tables** similar to spreadsheets, but with much more power and functionality.

Here's a closer look at DataFrames:

**Structure:**

* **Rows:** Represent individual records or observations.
* **Columns:** Represent variables or features within each record.
* **Cells:** Intersection of rows and columns, containing specific data points.

**Data Types:**

* Can hold various data types in each cell, such as numbers, strings, booleans, dates, and even other DataFrames (nested!).
* Allows mixing data types within columns, providing flexibility for diverse data sets.

**Key Features:**

* **Indexing and selection:** Access specific rows, columns, or cells using labels, positions, or logical conditions.
* **Operations:** Perform calculations, aggregations, filtering, and sorting on data within columns or rows.
* **Merging and joining:** Combine data from multiple DataFrames based on shared information.
* **Visualization:** Easily visualize data patterns and relationships through built-in plotting functions.

**Benefits:**

* **Organized data representation:** Provides a clear and structured way to view and work with complex data sets.
* **Efficient data manipulation:** Offers powerful tools for cleaning, analyzing, and preparing data for further analysis.
* **Flexibility and versatility:** Adapts to various data types and analysis needs, making it a versatile tool for diverse tasks.

**In summary, DataFrames are the workhorses of Pandas.** They offer a user-friendly and powerful way to manage and analyze data, making them essential for anyone working with data science, analytics, or research.



In [66]:
#import the pandas

import pandas as pd
import numpy as np

In [67]:
#playing with dataframe

df = pd.DataFrame(np.arange(0, 24).reshape(6, 4), index = ["Row1", "Row2", "Row3", "Row4", "Row5", "Row5"], columns = ["Column1", "Column2", "Column3", "Column4"])

In [68]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


#### **NOTE ==>** head(): This is a built-in method of pandas. DataFrame. It returns the first five rows of the DataFrame.

In [69]:
df.to_csv("test.csv")

#### **to_csv()** Built-in method in Pandas to save DataFrames as CSV files.

## Accessing DataFrame Elements: Cheat Sheet

**Label-based:**

* **Single element:** `df['column']['row']`
* **Column:** `df['column_name']`
* **Multiple columns:** `df[['col1', 'col2']]`

**Position-based (iloc):**

* **Row:** `df.iloc[row_index]`
* **Specific element:** `df.iloc[row_index, col_index]`
* **Subset:** `df.iloc[start_row:end_row, start_col:end_col]`

**Boolean Indexing:**

* **Conditionally select rows:** `df[df['column'] > value]`

**Tips:**

* Indices start from 0.
* Use `df.head()` to understand structure.
* More advanced options in Pandas documentation.

**Bonus:**

* Use `df.loc` for non-integer labels.
* Use `df.at`/`df.iat` for faster scalar access.

In [70]:
df = pd.DataFrame(np.arange(0, 24).reshape(6, 4), index = ["Row1", "Row2", "Row3", "Row4", "Row5", "Row5"], columns = ["Column1", "Column2", "Column3", "Column4"])

In [71]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [72]:
df["Column1"]["Row1"]

0

In [73]:
df["Column1"]

Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Row5    20
Name: Column1, dtype: int32

In [74]:
df[["Column1", "Column2"]]

Unnamed: 0,Column1,Column2
Row1,0,1
Row2,4,5
Row3,8,9
Row4,12,13
Row5,16,17
Row5,20,21


In [75]:
df.loc["Row1"]

Column1    0
Column2    1
Column3    2
Column4    3
Name: Row1, dtype: int32

In [76]:
type(df.loc["Row1"])

pandas.core.series.Series

## Data Series: Quick Guide

* **1. DataFrame Column:**
    - Single column of data within a Pandas DataFrame.
    - Think 1D list of specific variable/feature values.
    - Accessed & manipulated like DataFrames (focused on column).
* **2. Independent Data Sequence:**
    - Any ordered set of data points, not part of a DataFrame.
    - Temperatures over time, stock prices, etc.
    - Analyzed for trends, patterns, & relationships.

**Remember:**

- Consider context to determine meaning.
- Pandas: Data series = DataFrame column.
- Other contexts: Data series = any independent data sequence.
- Series can be either one row or one column.

In [77]:
df.iloc[0:2, 0:2]

Unnamed: 0,Column1,Column2
Row1,0,1
Row2,4,5


In [78]:
type(df.iloc[0:2, 0:2])

pandas.core.frame.DataFrame

In [80]:
type(df.iloc[0:1, 0])

pandas.core.series.Series

## DataFrame to Array: Short Guide

* **`to_numpy()`:** Entire DataFrame to NumPy array (copy).
* **`.values`:** Access underlying NumPy array directly.
* **Specific columns:** `df['col_name'].to_numpy()`.
* **Multiple columns:** `df.iloc[:, [0, 2]].to_numpy()` (precise selection).

**Tips:**

* Choose method based on your needs (entire/specific data).
* `to_numpy()` creates a copy, `.values` accesses directly.
* DataFrames can have mixed types, arrays typically single type.

In [82]:
df.iloc[:, 1:].values

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15],
       [17, 18, 19],
       [21, 22, 23]])

In [84]:
df.iloc[:, 1:].values.shape

(6, 3)

In [87]:
df.isnull() #FIND THE NULL VALUES IN DATAFRAME

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,False,False,False,False
Row2,False,False,False,False
Row3,False,False,False,False
Row4,False,False,False,False
Row5,False,False,False,False
Row5,False,False,False,False


In [89]:
df.isnull().sum() #COUNT THE NULL VALUES

Column1    0
Column2    0
Column3    0
Column4    0
dtype: int64

In [90]:
df["Column1"].value_counts() #COUNT THE UNIQUE VALUES

Column1
0     1
4     1
8     1
12    1
16    1
20    1
Name: count, dtype: int64

In [91]:
df["Column1"].unique() #EXTRACTS THE UNIQUE, NON-DUPLICATED VALUES WITHIN A SPECIFIC COLUMN OF A DATAFRAME.

array([ 0,  4,  8, 12, 16, 20])