**Pandas** is a powerful and popular open-source Python library used for data manipulation and analysis. It provides data structures and functions that allow for efficient handling and processing of structured data.

In [1]:
import pandas as pd
import numpy as np

A  **DataFrame** is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet in excel.

In [2]:
dat = pd.DataFrame()
dat

In [3]:
dfd = pd.DataFrame(
    {
        "Name": ["Braund", "Harris","Allen"],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

In [4]:
dfd

Unnamed: 0,Name,Age,Sex
0,Braund,22,male
1,Harris,35,male
2,Allen,58,female


**Accessing column values**

In [5]:
dfd["Age"]

0    22
1    35
2    58
Name: Age, dtype: int64

In [6]:
dfd["Name"]

0    Braund
1    Harris
2     Allen
Name: Name, dtype: object

Each column in a dataframe is a **Series**.<br>
Creating a series.

In [7]:
fruits = pd.Series(["Peas","Berries","Oranges"], name="Fruits")

fruits

0       Peas
1    Berries
2    Oranges
Name: Fruits, dtype: object

In [8]:
dfd["Age"].max()

58

In [9]:
dfd.describe()

Unnamed: 0,Age
count,3.0
mean,38.333333
std,18.230012
min,22.0
25%,28.5
50%,35.0
75%,46.5
max,58.0


**Create a pandas series from a dictionary showing days and dates**

In [10]:
data = {
    'Monday': '2023-07-03',
    'Tuesday': '2023-07-04',
    'Wednesday': '2023-07-05'
}

series = pd.Series(data)
print(series)

Monday       2023-07-03
Tuesday      2023-07-04
Wednesday    2023-07-05
dtype: object


In [11]:
# Accessing 'Monday' and 'Tuesday' using loc
new_series = series.loc[['Monday', 'Tuesday']]
print(new_series)

Monday     2023-07-03
Tuesday    2023-07-04
dtype: object


**ANALYZING AND CLEANING DATA EXERCISE**

In [12]:
df = pd.read_csv('datasets/Work.csv')
df

Unnamed: 0,Name,city,age,py-score
0,Emma,Kampala,23,90
1,Wilber,Mbale,26,75
2,Robin,Gulu,25,mine
3,Tevor,Livingstone,,89
4,Yeko,Tororo,20,94
5,Miriam,Arua,Train,
6,Jesca,Mbarara,21,84


In [13]:
# Understanding of data structure
# Checking shape of data (rows, columns)
df.shape

(7, 4)

In [14]:
df.head()

Unnamed: 0,Name,city,age,py-score
0,Emma,Kampala,23.0,90
1,Wilber,Mbale,26.0,75
2,Robin,Gulu,25.0,mine
3,Tevor,Livingstone,,89
4,Yeko,Tororo,20.0,94


In [15]:
df.head(2)

Unnamed: 0,Name,city,age,py-score
0,Emma,Kampala,23,90
1,Wilber,Mbale,26,75


In [16]:
df.tail()

Unnamed: 0,Name,city,age,py-score
2,Robin,Gulu,25,mine
3,Tevor,Livingstone,,89
4,Yeko,Tororo,20,94
5,Miriam,Arua,Train,
6,Jesca,Mbarara,21,84


In [17]:
df.tail(1)

Unnamed: 0,Name,city,age,py-score
6,Jesca,Mbarara,21,84


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      7 non-null      object
 1   city      7 non-null      object
 2   age       6 non-null      object
 3   py-score  6 non-null      object
dtypes: object(4)
memory usage: 352.0+ bytes


In [19]:
# Checking for duplicates
df.duplicated().sum()

0

In [20]:
df.describe()

Unnamed: 0,Name,city,age,py-score
count,7,7,6,6
unique,7,7,6,6
top,Emma,Kampala,23,90
freq,1,1,1,1


In [21]:
# Handling wrong format, a string mine in a numerical column
df['py-score'].replace('mine', np.nan , inplace=True)

In [22]:
df

Unnamed: 0,Name,city,age,py-score
0,Emma,Kampala,23,90.0
1,Wilber,Mbale,26,75.0
2,Robin,Gulu,25,
3,Tevor,Livingstone,,89.0
4,Yeko,Tororo,20,94.0
5,Miriam,Arua,Train,
6,Jesca,Mbarara,21,84.0


In [23]:
# Drop rows with any missing values
df1 = df.dropna()

In [24]:
df1

Unnamed: 0,Name,city,age,py-score
0,Emma,Kampala,23,90
1,Wilber,Mbale,26,75
4,Yeko,Tororo,20,94
6,Jesca,Mbarara,21,84


In [25]:
# Performing correlation on the numerical columns.
specific_columns = ['age', 'py-score']
df1[specific_columns].corr()

Unnamed: 0,age,py-score
age,1.0,-0.815891
py-score,-0.815891,1.0


**SECOND DATASET.**

In [26]:
df2 = pd.read_csv('datasets/mine.csv')
df2

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [27]:
# Understanding of data structure
# Checking shape of data (rows, columns)
df2.shape

(169, 4)

In [28]:
df2.head()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0


In [29]:
df2.tail()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


In [30]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB


In [31]:
df2.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,169.0,169.0,169.0,164.0
mean,63.846154,107.461538,134.047337,375.790244
std,42.299949,14.510259,16.450434,266.379919
min,15.0,80.0,100.0,50.3
25%,45.0,100.0,124.0,250.925
50%,60.0,105.0,131.0,318.6
75%,60.0,111.0,141.0,387.6
max,300.0,159.0,184.0,1860.4


In [32]:
# Using the fillna() function.
df2['Calories'] = df2['Calories'].fillna(df2['Calories'].median())

In [33]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  169 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB


In [34]:
df2.dtypes

Duration      int64
Pulse         int64
Maxpulse      int64
Calories    float64
dtype: object

In [35]:
# Checking for duplicates
df2.duplicated().sum()

7

In [36]:
# removing duplicate data
df2 = df2.drop_duplicates(keep = 'first')

In [37]:
# Show that duplicates are removed
df2.duplicated().sum()

0

In [38]:
# Show the correlation
df2.corr()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
Duration,1.0,-0.162098,0.003578,0.922762
Pulse,-0.162098,1.0,0.787035,0.018697
Maxpulse,0.003578,0.787035,1.0,0.196973
Calories,0.922762,0.018697,0.196973,1.0
