# Pandas

### What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

### Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

### What Can Pandas Do?
Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

In [1]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


In [2]:
import pandas as pd

print(pd.__version__)

2.2.2


### Pandas Series

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [3]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


#### Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

#### Create Labels

In [4]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64


#### Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

In [5]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


In [6]:
# Create a Series using only data from "day1" and "day2":

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

day1    420
day2    380
dtype: int64


### DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [7]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


#### Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the *loc* attribute to return one or more specified row(s)

In [8]:
#refer to the row index:
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


In [9]:
#use a list of indexes:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


#### Named Indexes
With the index argument, you can name your own indexes.

In [10]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


In [11]:
# prompt: show in ascending order of their calories

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

# Assuming "calories" is the column representing calories
df_sorted = df.sort_values(by="duration", ascending=False)

df_sorted

Unnamed: 0,calories,duration
day1,420,50
day3,390,45
day2,380,40


In [12]:
# prompt: append to df from input

import pandas as pd

# Assuming you have an existing DataFrame called 'df'
# Example DataFrame (replace with your actual DataFrame)
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

# Get input from the user
new_calories = int(input("Enter calories: "))
new_duration = int(input("Enter duration: "))
new_index = input("Enter index: ")

# Create a new row as a dictionary
new_row = {'calories': new_calories, 'duration': new_duration}


# Create a new DataFrame from the new row data
new_df = pd.DataFrame([new_row], index=[new_index])

# Concatenate the new DataFrame with the existing DataFrame
df = pd.concat([df, new_df])

# Display the updated DataFrame
df

Enter calories: 33
Enter duration: 3
Enter index: 1


Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45
1,33,3


In [13]:
df=pd.DataFrame(columns=["semster","SGPA"])
n=int(input("Enter the number of semsters"))
for i in range(n):
  new_semster=input("Enter the semster:")
  new_SGPA=float(input("Enter the SGPA:"))
  new_row={"semster":new_semster,"SGPA":new_SGPA}
  new_df=pd.DataFrame([new_row])
  df=pd.concat([df,new_df])
print(df)

Enter the number of semsters1
Enter the semster:34
Enter the SGPA:4
  semster  SGPA
0      34   4.0


  df=pd.concat([df,new_df])


locate a field

In [14]:
print(df.iloc[0,1])

4.0


#### Locate Named Indexes

In [16]:
#refer to the named index:
print(df.loc["day2"])

KeyError: 'day2'

#### Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame.

In [None]:
df = pd.read_csv('dataset.csv')

print(df)

#### Read CSV Files

In [None]:
df = pd.read_csv('data.csv')

print(df.to_string())

In [None]:
print(df)

### Pandas - Analyzing DataFrames

In [None]:
df = pd.read_csv('data.csv')

print(df.head(10))

In [None]:
import pandas as pd
df = pd.read_csv('/content/household_power_consumption.txt', sep=';', low_memory=False)
print(df.head())

In [None]:
# Print the first 5 rows of the DataFrame:
print(df.head())

In [None]:
# Print the last 5 rows of the DataFrame:

print(df.tail())

In [None]:
# Print information about the data:

print(df.info())

### Data Cleaning
Data cleaning means fixing bad data in your data set.

Bad data could be:

    Empty cells
    Data in wrong format
    Wrong data
    Duplicates

#### Empty Cells

One way to deal with empty cells is to remove rows that contain empty cells.



In [None]:
# Return a new Data Frame with no empty cells:
df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())

In [None]:
# Remove all rows with NULL values:

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string())

In [None]:
# Replace NULL values with the number 130:

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True)
print(df.to_string())

In [None]:
# Replace NULL values in the "Calories" columns with the number 130:


df = pd.read_csv('data.csv')

df["Calories"].fillna(130, inplace = True)
print(df.to_string())

#### Replace Using Mean, Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

In [None]:
df = pd.read_csv('data.csv')

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)

In [None]:
df = pd.read_csv('data.csv')

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)

#### Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.

In [None]:
# Convert Into a Correct Format
df = pd.read_csv('data.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

In [None]:
df = pd.read_csv('data.csv')

In [None]:
df

In [None]:
print(df.duplicated())

#### Finding Relationships

In [None]:
df.corr()

## Pandas - Plotting

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot()

plt.show()

In [None]:
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show()

In [None]:
df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')

plt.show()

In [None]:
df["Duration"].plot(kind = 'hist')