<a href="https://colab.research.google.com/github/themysterysolver/PYTHON_BASICS/blob/main/PANDAS/GFG_FULL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#PANDAS

- It is built on top of the NumPy library which means that a lot of the structures of NumPy are used or replicated in Pandas.
- The data produced by Pandas is often used as input for plotting functions in Matplotlib, statistical analysis in SciPy, and machine learning algorithms in Scikit-learn.
- we can do using Pandas.
  - Data set cleaning, merging, and joining.
  - Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
  - Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
  - Powerful group by functionality for performing split-apply-combine operations on data sets.
  - Data Visualization.

- Data Structures in Pandas Library
Pandas generally provide two data structures for manipulating data. They are:
  - Series
  - DataFrame

- Pandas DataStructures is created by loading the datasets from existing storage (which can be a SQL database, a CSV file, or an Excel file).
- Pandas DataStructures can be created from ***lists, dictionaries, a list of dictionaries***, etc.

In [2]:
import pandas as pd
import numpy as np

In [None]:
ser=pd.Series(np.array([chr(i+ord('a')) for i in range(1,5)]))
print(ser)
df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})#len(row) should be equal for all rows
print(df)

0    b
1    c
2    d
3    e
dtype: object
   a  b
0  1  4
1  2  5
2  3  6


- The axis labels are collectively called `index`. Pandas Series is nothing but a column in an excel sheet.
- *Accessing element of Series*
There are two ways through which we can access element of series, they are :
  - Accessing Element from Series with Position
  - Accessing Element Using Label (index)

In [None]:
print(ser)

0    b
1    c
2    d
3    e
dtype: object


In [None]:
ser[:3]

Unnamed: 0,0
0,b
1,c
2,d


In [None]:
ser[3]#acessing via label

'e'

In [None]:
ser.index=([12,13,14,15])#changing label
print(ser)

12    b
13    c
14    d
15    e
dtype: object


In [None]:
print(ser[12])#acessing via label

b
12    b
13    c
dtype: object


###Indexing
- Indexing in pandas means simply selecting particular data from a Series.
- Indexing could mean selecting all the data, some of the data from particular columns. Indexing can also be known as Subset Selection.
---
- Indexing operator is used to refer to the square brackets following an object.
- The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[ ]

In [None]:
print(ser.loc[12:13]) #loc is used for label-based indexing. It allows you to access rows and columns by their labels or boolean arrays.
print(ser.iloc[1:3]) #iloc is used for integer-location-based indexing. It selects data by position (integer indices).

12    b
13    c
dtype: object
13    c
14    d
dtype: object


## Binary Operation on Series
- <series\>.add(<series\>)
- <series\>.sub(<series\>)
- `+,-,\*,/,**` can also be used!
---
- Addition: add (+)
- Subtraction: sub (-)
- Multiplication: mul (*)
- Division: div or truediv (/)
- Floor Division: floordiv (//)
- Modulus: mod (%)
- Power: pow (**)
- Equality: eq (==)
- Not Equal: ne (!=)
- Greater Than: gt (>)
- Less Than: lt (<)
- Greater Than or Equal: ge (>=)
- Less Than or Equal: le (<=)
- Logical AND: & (bitwise AND)
- Logical OR: | (bitwise OR)
- Logical XOR: ^ (bitwise XOR)
- Logical NOT: ~ (bitwise NOT)


The `fill_value` parameter in pandas is used with binary operations to handle missing data (i.e., `NaN`(Not a number) values) by specifying a value to use instead of NaN during the operation. This can be particularly useful when performing operations between two Series or DataFrames with mismatched indices(label).

In [None]:
data1=pd.Series([1,2,3,4],index=['a','b','c','f'])
data2=pd.Series([1,2,3,4],index=['a','b','d','e'])
print(data1,'\n',data2)

a    1
b    2
c    3
f    4
dtype: int64 
 a    1
b    2
d    3
e    4
dtype: int64


In [None]:
print(data1+data2,type(data1+data2))
print(data1.add(data2,fill_value=100))
result=data1+data2
result.fill_value=100
print(result)

a    2.0
b    4.0
c    NaN
d    NaN
e    NaN
f    NaN
dtype: float64 <class 'pandas.core.series.Series'>
a      2.0
b      4.0
c    103.0
d    103.0
e    104.0
f    104.0
dtype: float64
a    2.0
b    4.0
c    NaN
d    NaN
e    NaN
f    NaN
dtype: float64


## Conversion Operation on Series


- In conversion operation we perform various operation like *changing datatype of series, changing a series to list* etc. In order to perform conversion operation we have various function which help in conversion like `.astype(), .tolist()` etc.

In [36]:
df=pd.DataFrame({
    'StudentID': [101, 102, 103, 104, 105, 106, 107],
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan', 'Fiona', 'George'],
    'Mathematics': [85, 78, 92, 88, 76, np.nan, 81],
    'Science': [89, 74, 95, 91, 80, 88, np.nan],
    'History': [90, 85, 87, 93, 82, 84, 89],
    'English': [88, np.nan, 90, 85, 78, 82, 77],
    'Geography': [92, 81, np.nan, 89, 85, 86, 83]
})
print(df)


   StudentID     Name  Mathematics  Science  History  English  Geography
0        101    Alice         85.0     89.0       90     88.0       92.0
1        102      Bob         78.0     74.0       85      NaN       81.0
2        103  Charlie         92.0     95.0       87     90.0        NaN
3        104    Diana         88.0     91.0       93     85.0       89.0
4        105    Ethan         76.0     80.0       82     78.0       85.0
5        106    Fiona          NaN     88.0       84     82.0       86.0
6        107   George         81.0      NaN       89     77.0       83.0


- Functionality: dropna() returns a new DataFrame with missing values (NaN) removed. By default, it drops rows that contain any NaN values.
   - Does not modify the original DataFrame unless you specify inplace=True.
- Functionality: dropna(inplace=True) modifies the original DataFrame in place, removing rows with missing values (NaN).
  -  No new DataFrame is created.
  - The changes are directly applied to the original DataFrame.

In [15]:
duplicate=df.copy() #The .copy() method in pandas creates a deep copy of a DataFrame or Series.
d=duplicate.dropna()
print(duplicate)
print(d)
duplicate.dropna(inplace=True)
print(duplicate)

   StudentID     Name  Mathematics  Science  History  English  Geography
0        101    Alice         85.0     89.0       90     88.0       92.0
1        102      Bob         78.0     74.0       85      NaN       81.0
2        103  Charlie         92.0     95.0       87     90.0        NaN
3        104    Diana         88.0     91.0       93     85.0       89.0
4        105    Ethan         76.0     80.0       82     78.0       85.0
5        106    Fiona          NaN     88.0       84     82.0       86.0
6        107   George         81.0      NaN       89     77.0       83.0
   StudentID   Name  Mathematics  Science  History  English  Geography
0        101  Alice         85.0     89.0       90     88.0       92.0
3        104  Diana         88.0     91.0       93     85.0       89.0
4        105  Ethan         76.0     80.0       82     78.0       85.0
   StudentID   Name  Mathematics  Science  History  English  Geography
0        101  Alice         85.0     89.0       90     88.0  

- The `isna()` method in pandas is used to detect missing values in a DataFrame or Series. It returns a DataFrame or Series of the same shape, where each element is a Boolean value: True if the value is NaN (Not a Number), and False otherwise.

In [23]:
print(df.isna(),'\n',df.isna().sum(),'\n',df.isna().sum().sum())

   StudentID   Name  Mathematics  Science  History  English  Geography
0      False  False        False    False    False    False      False
1      False  False        False    False    False     True      False
2      False  False        False    False    False    False       True
3      False  False        False    False    False    False      False
4      False  False        False    False    False    False      False
5      False  False         True    False    False    False      False
6      False  False        False     True    False    False      False 
 StudentID      0
Name           0
Mathematics    1
Science        1
History        0
English        1
Geography      1
dtype: int64 
 4



The `strip()` method in Python is used to remove any leading and trailing whitespace characters (spaces, tabs, newlines, etc.) from a string. It does not modify the original string but returns a new string with the whitespace removed.

In [27]:
print(d.columns)
d.columns=d.columns.str.strip()
print(type(d['History']))
d['History']=d['History'].tolist()
print(type(d['History']))

Index(['StudentID', 'Name', 'Mathematics', 'Science', 'History', 'English',
       'Geography'],
      dtype='object')
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


- `count()`	Returns number of ***non***-NA/null observations in the Series
- `size()`	Returns the number of elements in the underlying data

In [26]:
print(d['Science'].count())

3


## axis
- `axis=0` (default): This means the operation will be performed along the rows, i.e., ***column-wise***.
- `axis=1`: This means the operation will be performed along the columns, i.e., ***row-wise***.

In [38]:
df = df.fillna(0)
print(df,type(df))
print(df.count())

   StudentID     Name  Mathematics  Science  History  English  Geography
0        101    Alice         85.0     89.0       90     88.0       92.0
1        102      Bob         78.0     74.0       85      0.0       81.0
2        103  Charlie         92.0     95.0       87     90.0        0.0
3        104    Diana         88.0     91.0       93     85.0       89.0
4        105    Ethan         76.0     80.0       82     78.0       85.0
5        106    Fiona          0.0     88.0       84     82.0       86.0
6        107   George         81.0      0.0       89     77.0       83.0 <class 'pandas.core.frame.DataFrame'>
StudentID      7
Name           7
Mathematics    7
Science        7
History        7
English        7
Geography      7
dtype: int64


In [29]:
print(df)

   StudentID     Name  Mathematics  Science  History  English  Geography
0        101    Alice         85.0     89.0       90     88.0       92.0
1        102      Bob         78.0     74.0       85      NaN       81.0
2        103  Charlie         92.0     95.0       87     90.0        NaN
3        104    Diana         88.0     91.0       93     85.0       89.0
4        105    Ethan         76.0     80.0       82     78.0       85.0
5        106    Fiona          NaN     88.0       84     82.0       86.0
6        107   George         81.0      NaN       89     77.0       83.0


In [39]:
df.head()

Unnamed: 0,StudentID,Name,Mathematics,Science,History,English,Geography
0,101,Alice,85.0,89.0,90,88.0,92.0
1,102,Bob,78.0,74.0,85,0.0,81.0
2,103,Charlie,92.0,95.0,87,90.0,0.0
3,104,Diana,88.0,91.0,93,85.0,89.0
4,105,Ethan,76.0,80.0,82,78.0,85.0


In [40]:
df.tail()

Unnamed: 0,StudentID,Name,Mathematics,Science,History,English,Geography
2,103,Charlie,92.0,95.0,87,90.0,0.0
3,104,Diana,88.0,91.0,93,85.0,89.0
4,105,Ethan,76.0,80.0,82,78.0,85.0
5,106,Fiona,0.0,88.0,84,82.0,86.0
6,107,George,81.0,0.0,89,77.0,83.0


 >`astype` It returns a new Series or DataFrame with the elements converted to the specified type.
  inplace=True


In [42]:
seri=pd.Series([1,2,3,4,5,6])
print(seri,type(seri))
seri=seri.astype(str)
print(seri,type(seri))

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64 <class 'pandas.core.series.Series'>
0    1
1    2
2    3
3    4
4    5
5    6
dtype: object <class 'pandas.core.series.Series'>


In [60]:
df['History'].dtype
print(df.iloc[:,1])

0      Alice
1        Bob
2    Charlie
3      Diana
4      Ethan
5      Fiona
6     George
Name: Name, dtype: object


### 1. **`inplace=True`**
The `inplace=True` parameter is used in pandas methods to modify the original DataFrame or Series directly without needing to assign the result back to the variable.

- Common methods that support `inplace=True`:
  - `drop()`: Drop rows or columns.
  - `fillna()`: Replace `NaN` values with specified values.
  - `rename()`: Rename columns or index labels.
  - `sort_values()`: Sort values by a particular column.
  - `set_index()`: Set a column as the index.
  - `reset_index()`: Reset the index.
  - `replace()`: Replace values.

---

### 2. **`axis`**
The `axis` parameter is used to specify whether the operation is performed along rows (`axis=0`) or columns (`axis=1`).

- Common methods that use `axis`:
  - `drop()`: To drop rows (`axis=0`) or columns (`axis=1`).
  - `apply()`: To apply a function along rows (`axis=0`) or columns (`axis=1`).
  - `sum()`, `mean()`, `std()`, etc.: To calculate statistics across rows (`axis=0`) or columns (`axis=1`).
  - `stack()`: Stack the DataFrame along columns or rows.
  - `pivot_table()`: Create pivot tables (usually `axis=0` for rows and `axis=1` for columns).


In [69]:
def grade(num):
  if num>=90:
    return 'A'
  elif num>=80:
    return 'A+'
  elif num>=70:
    return 'A'
  elif num>=60:
    return 'B+'
  elif num>=50:
    return 'B'
  else:
    return 'F'
#print(df.iloc[1,2:].dtype)
print(df['Mathematics'])
for sub in ['Mathematics', 'Science', 'History', 'English', 'Geography']:
    df[sub+' Grade']=df[sub].apply(grade)
print(df)

0    85.0
1    78.0
2    92.0
3    88.0
4    76.0
5     0.0
6    81.0
Name: Mathematics, dtype: float64
   StudentID     Name  Mathematics  Science  History  English  Geography  \
0        101    Alice         85.0     89.0       90     88.0       92.0   
1        102      Bob         78.0     74.0       85      0.0       81.0   
2        103  Charlie         92.0     95.0       87     90.0        0.0   
3        104    Diana         88.0     91.0       93     85.0       89.0   
4        105    Ethan         76.0     80.0       82     78.0       85.0   
5        106    Fiona          0.0     88.0       84     82.0       86.0   
6        107   George         81.0      0.0       89     77.0       83.0   

  Mathematics Grade Science Grade History Grade English Grade Geography Grade  
0                A+            A+             A            A+               A  
1                 A             A            A+             F              A+  
2                 A             A            A+

In [70]:
df.iloc[1,:]

Unnamed: 0,1
StudentID,102
Name,Bob
Mathematics,78.0
Science,74.0
History,85
English,0.0
Geography,81.0
Mathematics Grade,A
Science Grade,A
History Grade,A+


In [73]:
df.describe()

Unnamed: 0,StudentID,Mathematics,Science,History,English,Geography
count,7.0,7.0,7.0,7.0,7.0,7.0
mean,104.0,71.428571,73.857143,87.142857,71.428571,73.714286
std,2.160247,31.988837,33.323808,3.804759,31.863548,32.709399
min,101.0,0.0,0.0,82.0,0.0,0.0
25%,102.5,77.0,77.0,84.5,77.5,82.0
50%,104.0,81.0,88.0,87.0,82.0,85.0
75%,105.5,86.5,90.0,89.5,86.5,87.5
max,107.0,92.0,95.0,93.0,90.0,92.0


In [74]:
df['Mathematics'].describe()

Unnamed: 0,Mathematics
count,7.0
mean,71.428571
std,31.988837
min,0.0
25%,77.0
50%,81.0
75%,86.5
max,92.0


## Column selection
-