<a href="https://colab.research.google.com/github/themysterysolver/PYTHON_BASICS/blob/main/PANDAS/GFG_FULL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#PANDAS

- It is built on top of the NumPy library which means that a lot of the structures of NumPy are used or replicated in Pandas.
- The data produced by Pandas is often used as input for plotting functions in Matplotlib, statistical analysis in SciPy, and machine learning algorithms in Scikit-learn.
- we can do using Pandas.
  - Data set cleaning, merging, and joining.
  - Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
  - Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
  - Powerful group by functionality for performing split-apply-combine operations on data sets.
  - Data Visualization.

- Data Structures in Pandas Library
Pandas generally provide two data structures for manipulating data. They are:
  - Series
  - DataFrame

- Pandas DataStructures is created by loading the datasets from existing storage (which can be a SQL database, a CSV file, or an Excel file).
- Pandas DataStructures can be created from ***lists, dictionaries, a list of dictionaries***, etc.

In [1]:
import pandas as pd
import numpy as np

In [None]:
ser=pd.Series(np.array([chr(i+ord('a')) for i in range(1,5)]))
print(ser)
df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})#len(row) should be equal for all rows
print(df)

0    b
1    c
2    d
3    e
dtype: object
   a  b
0  1  4
1  2  5
2  3  6


- The axis labels are collectively called `index`. Pandas Series is nothing but a column in an excel sheet.
- *Accessing element of Series*
There are two ways through which we can access element of series, they are :
  - Accessing Element from Series with Position
  - Accessing Element Using Label (index)

In [None]:
print(ser)

0    b
1    c
2    d
3    e
dtype: object


In [None]:
ser[:3]

Unnamed: 0,0
0,b
1,c
2,d


In [None]:
ser[3]#acessing via label

'e'

In [None]:
ser.index=([12,13,14,15])#changing label
print(ser)

12    b
13    c
14    d
15    e
dtype: object


In [None]:
print(ser[12])#acessing via label

b
12    b
13    c
dtype: object


###Indexing
- Indexing in pandas means simply selecting particular data from a Series.
- Indexing could mean selecting all the data, some of the data from particular columns. Indexing can also be known as Subset Selection.
---
- Indexing operator is used to refer to the square brackets following an object.
- The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[ ]

In [None]:
print(ser.loc[12:13]) #loc is used for label-based indexing. It allows you to access rows and columns by their labels or boolean arrays.
print(ser.iloc[1:3]) #iloc is used for integer-location-based indexing. It selects data by position (integer indices).

12    b
13    c
dtype: object
13    c
14    d
dtype: object


## Binary Operation on Series
- <series\>.add(<series\>)
- <series\>.sub(<series\>)
- `+,-,\*,/,**` can also be used!
---
- Addition: add (+)
- Subtraction: sub (-)
- Multiplication: mul (*)
- Division: div or truediv (/)
- Floor Division: floordiv (//)
- Modulus: mod (%)
- Power: pow (**)
- Equality: eq (==)
- Not Equal: ne (!=)
- Greater Than: gt (>)
- Less Than: lt (<)
- Greater Than or Equal: ge (>=)
- Less Than or Equal: le (<=)
- Logical AND: & (bitwise AND)
- Logical OR: | (bitwise OR)
- Logical XOR: ^ (bitwise XOR)
- Logical NOT: ~ (bitwise NOT)


The `fill_value` parameter in pandas is used with binary operations to handle missing data (i.e., `NaN`(Not a number) values) by specifying a value to use instead of NaN during the operation. This can be particularly useful when performing operations between two Series or DataFrames with mismatched indices(label).

In [None]:
data1=pd.Series([1,2,3,4],index=['a','b','c','f'])
data2=pd.Series([1,2,3,4],index=['a','b','d','e'])
print(data1,'\n',data2)

a    1
b    2
c    3
f    4
dtype: int64 
 a    1
b    2
d    3
e    4
dtype: int64


In [None]:
print(data1+data2,type(data1+data2))
print(data1.add(data2,fill_value=100))
result=data1+data2
result.fill_value=100
print(result)

a    2.0
b    4.0
c    NaN
d    NaN
e    NaN
f    NaN
dtype: float64 <class 'pandas.core.series.Series'>
a      2.0
b      4.0
c    103.0
d    103.0
e    104.0
f    104.0
dtype: float64
a    2.0
b    4.0
c    NaN
d    NaN
e    NaN
f    NaN
dtype: float64


## Conversion Operation on Series


- In conversion operation we perform various operation like *changing datatype of series, changing a series to list* etc. In order to perform conversion operation we have various function which help in conversion like `.astype(), .tolist()` etc.

In [6]:
df=pd.read_csv("student.csv")
print(df)

   StudentID      Name  Mathematics  Science   History  English  Geography
0        101     Alice           85       89        90       88         92
1        102       Bob           78       74        85      NaN         81
2        103   Charlie           92       95        87       90        NaN
3        104     Diana           88       91        93       85         89
4        105     Ethan           76       80        82       78         85
5        106     Fiona          NaN       88        84       82         86
6        107    George           81      NaN        89       77         83


- Functionality: dropna() returns a new DataFrame with missing values (NaN) removed. By default, it drops rows that contain any NaN values.
   - Does not modify the original DataFrame unless you specify inplace=True.
- Functionality: dropna(inplace=True) modifies the original DataFrame in place, removing rows with missing values (NaN).
  -  No new DataFrame is created.
  - The changes are directly applied to the original DataFrame.

In [16]:
duplicate=df.copy() #The .copy() method in pandas creates a deep copy of a DataFrame or Series.
d=duplicate.dropna()
print(duplicate)
print(d)
duplicate.dropna(inplace=True)
print(duplicate)

   StudentID      Name  Mathematics  Science   History  English  Geography
0        101     Alice           85       89        90       88         92
1        102       Bob           78       74        85      NaN         81
2        103   Charlie           92       95        87       90        NaN
3        104     Diana           88       91        93       85         89
4        105     Ethan           76       80        82       78         85
5        106     Fiona          NaN       88        84       82         86
6        107    George           81      NaN        89       77         83
   StudentID      Name  Mathematics  Science   History  English  Geography
0        101     Alice           85       89        90       88         92
1        102       Bob           78       74        85      NaN         81
2        103   Charlie           92       95        87       90        NaN
3        104     Diana           88       91        93       85         89
4        105     Ethan   


The `strip()` method in Python is used to remove any leading and trailing whitespace characters (spaces, tabs, newlines, etc.) from a string. It does not modify the original string but returns a new string with the whitespace removed.

In [18]:
print(d.columns)
d.columns=d.columns.str.strip()
d['History']
print(type(d['History']))
d['History']=d['History'].tolist()
print(type(d['History']))

Index(['StudentID', ' Name', ' Mathematics', ' Science', ' History',
       ' English', ' Geography'],
      dtype='object')
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
