# Pandas is a powerful Python library used for data analysis and manipulation.

## Here's a quick overview of Pandas:

### What it does:

* Loads and cleans datasets of various formats (CSV, Excel, SQL databases, etc.).

* Creates and manipulates data structures like DataFrames (similar to spreadsheets) and Series (single arrays).

* Performs data analysis tasks like filtering, sorting, grouping, aggregating, and statistical calculations.

* Enables data visualization through built-in plotting functions and integration with other libraries like Matplotlib.

### Why it's popular:

* Easy to learn: User-friendly syntax and extensive documentation make it accessible to users of all levels.

* Powerful and versatile: Handles a wide range of data types and analysis tasks.

* Integrates well with other libraries: Works seamlessly with popular scientific computing libraries like NumPy and SciPy.

* Open-source and community-driven: Continuously improving with active development and a helpful community.

### Who uses it:

* Data scientists, analysts, and researchers.

* Financial analysts and economists.

* Machine learning engineers and developers.

* Anyone who needs to work with and analyze data effectively.

## What is Data Frames?

DataFrames are the **backbone of Pandas**, serving as the primary data structure for holding and manipulating data. Think of them as **flexible, multi-dimensional tables** similar to spreadsheets, but with much more power and functionality.

Here's a closer look at DataFrames:

**Structure:**

* **Rows:** Represent individual records or observations.
* **Columns:** Represent variables or features within each record.
* **Cells:** Intersection of rows and columns, containing specific data points.

**Data Types:**

* Can hold various data types in each cell, such as numbers, strings, booleans, dates, and even other DataFrames (nested!).
* Allows mixing data types within columns, providing flexibility for diverse data sets.

**Key Features:**

* **Indexing and selection:** Access specific rows, columns, or cells using labels, positions, or logical conditions.
* **Operations:** Perform calculations, aggregations, filtering, and sorting on data within columns or rows.
* **Merging and joining:** Combine data from multiple DataFrames based on shared information.
* **Visualization:** Easily visualize data patterns and relationships through built-in plotting functions.

**Benefits:**

* **Organized data representation:** Provides a clear and structured way to view and work with complex data sets.
* **Efficient data manipulation:** Offers powerful tools for cleaning, analyzing, and preparing data for further analysis.
* **Flexibility and versatility:** Adapts to various data types and analysis needs, making it a versatile tool for diverse tasks.

**In summary, DataFrames are the workhorses of Pandas.** They offer a user-friendly and powerful way to manage and analyze data, making them essential for anyone working with data science, analytics, or research.



In [288]:
#import the pandas

import pandas as pd
import numpy as np

In [289]:
#playing with dataframe

df = pd.DataFrame(np.arange(0, 24).reshape(6, 4), index = ["Row1", "Row2", "Row3", "Row4", "Row5", "Row5"], columns = ["Column1", "Column2", "Column3", "Column4"])

In [290]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


#### **NOTE ==>** head(): This is a built-in method of pandas. DataFrame. It returns the first five rows of the DataFrame.

In [291]:
df.to_csv("test.csv")

#### **to_csv()** Built-in method in Pandas to save DataFrames as CSV files.

## Accessing DataFrame Elements: Cheat Sheet

**Label-based:**

* **Single element:** `df['column']['row']`
* **Column:** `df['column_name']`
* **Multiple columns:** `df[['col1', 'col2']]`

**Position-based (iloc):**

* **Row:** `df.iloc[row_index]`
* **Specific element:** `df.iloc[row_index, col_index]`
* **Subset:** `df.iloc[start_row:end_row, start_col:end_col]`

**Boolean Indexing:**

* **Conditionally select rows:** `df[df['column'] > value]`

**Tips:**

* Indices start from 0.
* Use `df.head()` to understand structure.
* More advanced options in Pandas documentation.

**Bonus:**

* Use `df.loc` for non-integer labels.
* Use `df.at`/`df.iat` for faster scalar access.

In [292]:
df = pd.DataFrame(np.arange(0, 24).reshape(6, 4), index = ["Row1", "Row2", "Row3", "Row4", "Row5", "Row5"], columns = ["Column1", "Column2", "Column3", "Column4"])

In [293]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [294]:
df["Column1"]["Row1"]

0

In [295]:
df["Column1"]

Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Row5    20
Name: Column1, dtype: int32

In [296]:
df[["Column1", "Column2"]]

Unnamed: 0,Column1,Column2
Row1,0,1
Row2,4,5
Row3,8,9
Row4,12,13
Row5,16,17
Row5,20,21


In [297]:
df.loc["Row1"]

Column1    0
Column2    1
Column3    2
Column4    3
Name: Row1, dtype: int32

In [298]:
type(df.loc["Row1"])

pandas.core.series.Series

## Data Series: Quick Guide

* **1. DataFrame Column:**
    - Single column of data within a Pandas DataFrame.
    - Think 1D list of specific variable/feature values.
    - Accessed & manipulated like DataFrames (focused on column).
* **2. Independent Data Sequence:**
    - Any ordered set of data points, not part of a DataFrame.
    - Temperatures over time, stock prices, etc.
    - Analyzed for trends, patterns, & relationships.

**Remember:**

- Consider context to determine meaning.
- Pandas: Data series = DataFrame column.
- Other contexts: Data series = any independent data sequence.
- Series can be either one row or one column.

In [299]:
df.iloc[0:2, 0:2]

Unnamed: 0,Column1,Column2
Row1,0,1
Row2,4,5


In [300]:
type(df.iloc[0:2, 0:2])

pandas.core.frame.DataFrame

In [301]:
type(df.iloc[0:1, 0])

pandas.core.series.Series

## DataFrame to Array: Short Guide

* **`to_numpy()`:** Entire DataFrame to NumPy array (copy).
* **`.values`:** Access underlying NumPy array directly.
* **Specific columns:** `df['col_name'].to_numpy()`.
* **Multiple columns:** `df.iloc[:, [0, 2]].to_numpy()` (precise selection).

**Tips:**

* Choose method based on your needs (entire/specific data).
* `to_numpy()` creates a copy, `.values` accesses directly.
* DataFrames can have mixed types, arrays typically single type.

In [302]:
df.iloc[:, 1:].values

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15],
       [17, 18, 19],
       [21, 22, 23]])

In [303]:
df.iloc[:, 1:].values.shape

(6, 3)

In [304]:
df.isnull() #FIND THE NULL VALUES IN DATAFRAME

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,False,False,False,False
Row2,False,False,False,False
Row3,False,False,False,False
Row4,False,False,False,False
Row5,False,False,False,False
Row5,False,False,False,False


In [305]:
df.isnull().sum() #COUNT THE NULL VALUES

Column1    0
Column2    0
Column3    0
Column4    0
dtype: int64

In [306]:
df["Column1"].value_counts() #COUNT THE UNIQUE VALUES

Column1
0     1
4     1
8     1
12    1
16    1
20    1
Name: count, dtype: int64

In [307]:
df["Column1"].unique() #EXTRACTS THE UNIQUE, NON-DUPLICATED VALUES WITHIN A SPECIFIC COLUMN OF A DATAFRAME.

array([ 0,  4,  8, 12, 16, 20])

## Reading Files with Pandas

* **CSV:** `pd.read_csv("file.csv")` - Common tabular data format.
* **Excel:** `pd.read_excel("file.xlsx")` - Spreadsheet format.
* **JSON:** `pd.read_json("file.json")` - Data interchange format.
* **Text:** `pd.read_fwf()`, `pd.read_table()` - Simple text formats.
* **SQL:** `pd.read_sql("SELECT * FROM table", engine)` - Database interaction.
* **More:** Additional formats with specific libraries/functions.

**Tips:**

* Specify file path.
* Customize reading with optional parameters (header, sep, etc.).

In [308]:
df = pd.read_csv("student.csv") #load cvs file

In [309]:
df.head() #See the first five rows

Unnamed: 0,id,name,class,mark,gender
0,1,Vithu,Four,75,male
1,2,Nila,Three,85,female
2,3,Arnold,Three,55,male
3,4,Krish,Four,60,female
4,5,John,Four,60,female


In [310]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      35 non-null     int64 
 1   name    35 non-null     object
 2   class   35 non-null     object
 3   mark    35 non-null     int64 
 4   gender  35 non-null     object
dtypes: int64(2), object(3)
memory usage: 1.5+ KB


### df.info()

* **Provides quick overview of DataFrame structure and content.**
* **Shows:**
    * Rows & columns count
    * Column names & data types
    * Memory usage
    * Non-null value counts
* **Useful for:**
    * Exploring data structure
    * Checking for missing values
    * Verifying data types
    * Optimizing memory usage

**Think:** Data snapshot for quick understanding and analysis!


In [311]:
df.describe()

Unnamed: 0,id,mark
count,35.0,35.0
mean,18.0,74.657143
std,10.246951,16.401117
min,1.0,18.0
25%,9.5,62.5
50%,18.0,79.0
75%,26.5,88.0
max,35.0,96.0


### `df.describe()`:

**Purpose:** 

`df.describe()` is a powerful tool in Pandas for generating **descriptive statistics** of your DataFrame's numeric columns. It provides a concise summary of the **central tendency, spread, and distribution** of your data, helping you gain insights into its characteristics.

**Output:**

The output of `df.describe()` depends on the data types within your DataFrame. For **numeric columns** it typically consists of:

* **Count:** The number of non-null values in the column.
* **Mean:** The average value of the non-null entries.
* **Standard deviation:** A measure of how spread out the data is around the mean.
* **Minimum and maximum:** The lowest and highest values found in the column.
* **Percentiles:** Values that split the data into equal proportions (e.g., 25th percentile divides the data into 25% lower and 75% higher values).

**Benefits:**

* **Quick data exploration:** Get a snapshot of the distribution of your numerical data without running complex calculations.
* **Outlier detection:** Identify potential outliers that deviate significantly from the main body of data.
* **Skewness assessment:** See if the data distribution is skewed towards one side (asymmetrical).
* **Central tendency and spread:** Understand the typical "middle" and range of your data points.

**Additional tips:**

* You can control the displayed percentiles and other statistics using optional arguments in `df.describe()`.
* Use `df.describe(include='all')` to include descriptive statistics for object columns (e.g., unique value counts).
* Remember, `df.describe()` only summarizes numeric data. For non-numeric columns, consider alternative analysis methods.

**Think of `df.describe()` as your statistical cheat sheet for understanding the numeric heart of your DataFrame.** 

In [312]:
#Get the unique category counts
df["mark"].value_counts()

mark
88    7
55    5
78    3
79    3
75    2
69    2
60    2
85    2
90    1
86    1
81    1
54    1
65    1
18    1
94    1
89    1
96    1
Name: count, dtype: int64

In [313]:
df[df["mark"] >= 75]

Unnamed: 0,id,name,class,mark,gender
0,1,Vithu,Four,75,male
1,2,Nila,Three,85,female
6,7,My John Rob,Fifth,78,male
7,8,Asruid,Five,85,male
8,9,Tes Qry,Six,78,male
10,11,Ronald,Six,89,female
11,12,Recky,Six,94,female
12,13,Kty,Seven,88,female
13,14,Bigy,Seven,88,female
14,15,Tade Row,Four,88,male


#  CSV

In [314]:
from io import StringIO, BytesIO

In [315]:
data = ("col1, col2, col3\n"
       "x, y, 1\n"
       "a, b, 2\n"
       "c, d, 3")

In [316]:
type(data)

str

In [317]:
pd.read_csv(StringIO(data))

Unnamed: 0,col1,col2,col3
0,x,y,1
1,a,b,2
2,c,d,3


In [318]:
#Read a specific columns
df = pd.read_csv(StringIO(data), usecols = ["col1"])

In [319]:
df

Unnamed: 0,col1
0,x
1,a
2,c


In [320]:
data = ("a, b, c, d\n"
       "1, 2, 3, 4\n"
       "5, 6, 7, 8\n"
       "9, 10, 11, 12")

In [321]:
print(data)

a, b, c, d
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12


In [322]:
# Read CSV data from a string, treating all columns as string objects (dtype=object)
df = pd.read_csv(StringIO(data), dtype = object)

In [323]:
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12


In [324]:
df["a"][0]

'1'

In [325]:
type(df["a"][0])

str

In [326]:
df["a"]

0    1
1    5
2    9
Name: a, dtype: object

In [327]:
df = pd.read_csv(StringIO(data), dtype = int)

In [328]:
type(df["a"][0])

numpy.int32

In [329]:
df = pd.read_csv(StringIO(data), dtype = float)

In [330]:
type(df["a"][0])

numpy.float64

In [331]:
#Give different datatype to each columns
df = pd.read_csv(StringIO(data), dtype={"a": "int64", "b": int, "c": float})

In [332]:
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12


In [333]:
type(df["a"][1])

numpy.int64

In [334]:
df.dtypes

a     int64
 b    int64
 c    int64
 d    int64
dtype: object

In [335]:
data = ("index,a, b, c\n"
       "4, apple, bat, 5.7\n"
       "8, orange, cow, 10\n")

In [336]:
pd.read_csv(StringIO(data))

Unnamed: 0,index,a,b,c
0,4,apple,bat,5.7
1,8,orange,cow,10.0


In [337]:
pd.read_csv(StringIO(data), index_col = 0)

Unnamed: 0_level_0,a,b,c
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,apple,bat,5.7
8,orange,cow,10.0


In [338]:
data = ("a, b, c\n"
       "4, apple, bat\n"
       "8, orange, cow\n")

In [339]:
pd.read_csv(StringIO(data))

Unnamed: 0,a,b,c
4,apple,bat,5.7
8,orange,cow,10.0
