# Introduction to Pandas
🦊 `Notebook by` [Md.Samiul Alim](https://github.com/sami0055)

😋  `Machine Learning Source Codes` [GitHub](https://github.com/sami0055/Machine-Learning)

# What is Pandas?
'Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.'



# Why Use Pandas?

Pandas offers several benefits that make it a preferred choice for data manipulation and analysis:



1. **Data Structures:** Pandas introduces two key data structures, Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled data structure), which are flexible and powerful for handling data.

2. **Ease of Data Handling:** It simplifies common data manipulation tasks like indexing, filtering, reshaping, aggregating, and cleaning data, making it efficient and straightforward.

3. **Integration with Other Libraries:** Pandas integrates well with other Python libraries used in data science, such as NumPy, Matplotlib, and scikit-learn, allowing seamless data transformation and analysis within these environments.

4. **Handling Missing Data:** Pandas provides functionalities to handle missing or incomplete data, making it easier to clean and preprocess datasets without compromising the analysis.

5. **Input/Output Tools:** Pandas supports reading and writing data from various file formats like CSV, Excel, SQL databases, JSON, and more, making it easy to work with different data sources.

6. **Performance:** While there might be trade-offs between speed and convenience, Pandas is generally optimized for performance when working with medium-sized datasets. For larger datasets, developers often combine Pandas with other libraries like Dask for distributed computing.

# What Can Pandas Do?

* Data Loading
* Data Exploration
* Data Cleaning
* Data Manipulation
* Feature Enginnering
* Data Preprocessing
* Integration with ML Libraries
* Is there a correlation between two or more columns?
* What is average value?
* Max value?
* Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.


## Installation of Pandas
Make sure that Python is already installed.

Install it using command line: `pip install pandas`

Install in notebook: `!pip install pandas`



## Import Pandas
Once Pandas is installed, import it in your applications by adding the `import` keyword: `import pandas`

In [138]:
import pandas as pd

## Pandas Series

In Pandas, a Series is a one-dimensional labeled array capable of holding data of any type (integer, float, string, Python objects, etc.). It's similar to a Python list or a one-dimensional NumPy array but provides additional features.

In [139]:
#Creating a series from a List
data=[10,20,30,40,50]
series=pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [140]:
#Access using for loop
for i in data:
    print(i)

10
20
30
40
50


In [141]:
type(series)

pandas.core.series.Series

In [142]:
#Custom Indexing in series
custom_index=['A','B','C','D','E']
custom_Series=pd.Series(data,index=custom_index)
print(custom_Series)

A    10
B    20
C    30
D    40
E    50
dtype: int64


In [143]:
#Accessing the item with index 'A'
print(custom_Series['A'])

10


## Pandas DataFrames

Pandas DataFrames are two-dimensional, size-mutable, and potentially heterogeneous tabular data structures with labeled axes (rows and columns). They resemble a spreadsheet or SQL table, and they consist of rows and columns, where each column can hold different types of data.

In [144]:
import pandas as pd
#Creating a DataFrame from a Dictionary
data={
    'Name':['Alice','Bob','Sami'],
    'Age':[12,14,22],
    'City':['NY','SF','Dhaka']
}
df=pd.DataFrame(data)
print(df)

    Name  Age   City
0  Alice   12     NY
1    Bob   14     SF
2   Sami   22  Dhaka


In [145]:
#Apply Custom index on df
custom_index=['ID1','ID2','ID3']
df_custom=pd.DataFrame(data,index=custom_index)
df_custom

Unnamed: 0,Name,Age,City
ID1,Alice,12,NY
ID2,Bob,14,SF
ID3,Sami,22,Dhaka


In [146]:
#Selecting specific columns by name
df[['Name']]

Unnamed: 0,Name
0,Alice
1,Bob
2,Sami


In [147]:
df[['Name','Age']]

Unnamed: 0,Name,Age
0,Alice,12
1,Bob,14
2,Sami,22


# Loc and iloc 

In [148]:
# importing the module
import pandas as pd
 
# creating a sample dataframe
data = pd.DataFrame({'Brand': ['Maruti', 'Hyundai', 'Tata',
                               'Mahindra', 'Maruti', 'Hyundai',
                               'Renault', 'Tata', 'Maruti'],
                     'Year': [2012, 2014, 2011, 2015, 2012,
                              2016, 2014, 2018, 2019],
                     'Kms Driven': [50000, 30000, 60000,
                                    25000, 10000, 46000,
                                    31000, 15000, 12000],
                     'City': ['Gurgaon', 'Delhi', 'Mumbai',
                              'Delhi', 'Mumbai', 'Delhi',
                              'Mumbai', 'Chennai',  'Ghaziabad'],
                     'Mileage':  [28, 27, 25, 26, 28,
                                  29, 24, 21, 24]})
 

data

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


### Example 1: Selecting Data According to Some Conditions
In this example, the code uses the loc function to select and display rows from the DataFrame where the brand is ‘Maruti’ and the mileage is greater than 25, showing relevant information about Maruti cars with high mileage.

In [149]:

data2=pd.DataFrame()
data2=data.loc[(data.Brand=='Maruti') & (data.Mileage>25)]
print(data2)

    Brand  Year  Kms Driven     City  Mileage
0  Maruti  2012       50000  Gurgaon       28
4  Maruti  2012       10000   Mumbai       28


### Example 2: Selecting a Range of Rows From the DataFrame
In this example, the code utilizes the loc function to extract and display rows with indices ranging from 2 to 5 (inclusive) from the DataFrame, providing information about a specific range of cars in the dataset.

In [150]:
display(data.loc[2:5])

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29


### Example 3: Updating the Value of Any Column
In this example, the code uses the loc function to update the ‘Mileage’ values to 22 for cars in the DataFrame where the manufacturing year is before 2015. The modified DataFrame is then displayed, reflecting the changes made to the Mileage column.

In [151]:
data.loc[(data.Year<2015),['Mileage']]=22
display(data)

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,22
1,Hyundai,2014,30000,Delhi,22
2,Tata,2011,60000,Mumbai,22
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,22
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,22
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


## iLoc Function

### Example 1: Selecting Rows Using Integer Indices
In this example, the code employs the iloc function to extract and display specific rows with indices 0, 2, 4, and 7 from the DataFrame, showcasing information about selected cars in the dataset.

In [152]:
# importing the module
import pandas as pd
 
# creating a sample dataframe
data = pd.DataFrame({'Brand': ['Maruti', 'Hyundai', 'Tata',
                               'Mahindra', 'Maruti', 'Hyundai',
                               'Renault', 'Tata', 'Maruti'],
                     'Year': [2012, 2014, 2011, 2015, 2012,
                              2016, 2014, 2018, 2019],
                     'Kms Driven': [50000, 30000, 60000,
                                    25000, 10000, 46000,
                                    31000, 15000, 12000],
                     'City': ['Gurgaon', 'Delhi', 'Mumbai',
                              'Delhi', 'Mumbai', 'Delhi',
                              'Mumbai', 'Chennai',  'Ghaziabad'],
                     'Mileage':  [28, 27, 25, 26, 28,
                                  29, 24, 21, 24]})
 

data

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


In [153]:
display(data.iloc[[0,2,4,5]])

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
2,Tata,2011,60000,Mumbai,25
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29


In [154]:
display(data.iloc[1:5,2:5])

Unnamed: 0,Kms Driven,City,Mileage
1,30000,Delhi,27
2,60000,Mumbai,25
3,25000,Delhi,26
4,10000,Mumbai,28


In [155]:
#Selectiong Specific Comumns by name
data[['Brand','Year']]

Unnamed: 0,Brand,Year
0,Maruti,2012
1,Hyundai,2014
2,Tata,2011
3,Mahindra,2015
4,Maruti,2012
5,Hyundai,2016
6,Renault,2014
7,Tata,2018
8,Maruti,2019


In [156]:
# importing the module
import pandas as pd
 
# creating a sample dataframe
data = pd.DataFrame({'Brand': ['Maruti', 'Hyundai', 'Tata',
                               'Mahindra', 'Maruti', 'Hyundai',
                               'Renault', 'Tata', 'Maruti'],
                     'Year': [2012, 2014, 2011, 2015, 2012,
                              2016, 2014, 2018, 2019],
                     'Kms Driven': [50000, 30000, 60000,
                                    25000, 10000, 46000,
                                    31000, 15000, 12000],
                     'City': ['Gurgaon', 'Delhi', 'Mumbai',
                              'Delhi', 'Mumbai', 'Delhi',
                              'Mumbai', 'Chennai',  'Ghaziabad'],
                     'Mileage':  [28, 27, 25, 26, 28,
                                  29, 24, 21, 24]})
 

data

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


In [160]:
custom_index = ["ID1", "ID2", "ID3","ID4", "ID5", "ID6",'ID7', "ID8", "ID9"]
df_with_custom_index = pd.DataFrame(data, index=custom_index)
display(df_with_custom_index)

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
ID1,,,,,
ID2,,,,,
ID3,,,,,
ID4,,,,,
ID5,,,,,
ID6,,,,,
ID7,,,,,
ID8,,,,,
ID9,,,,,


In [161]:
df_with_custom_index.loc['ID1':'ID5']

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
ID1,,,,,
ID2,,,,,
ID3,,,,,
ID4,,,,,
ID5,,,,,


### query
The .query() method allows you to filter rows based on a query expression

In [162]:
import pandas as pd
#creating a DataFrame from a Dictionary
data={
    'Name':['Alice','Bob','Sami','Sahil','Tahsin','Mushfiq'],
    'Age':[12,13,22,22,23,21],
    'City':['NY','UK','Chittagong','Dhaka','Khulna','Dhaka']
}
df=pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,13,UK
2,Sami,22,Dhaka
3,Sahil,22,Dhaka
4,Tahsin,23,Dhaka
5,Mushfiq,21,Dhaka


In [171]:
#Filtering rows using query expression
#filtered=df.query('City=="Dhaka"')
filtered=df.query('Age>20')
filtered

Unnamed: 0,Name,Age,City
2,Sami,22,Dhaka
3,Sahil,22,Dhaka
4,Tahsin,23,Dhaka
5,Mushfiq,21,Dhaka


## Data Exploration and Information

### info()

The df.info() method in Pandas provides a concise summary of a DataFrame, including the index dtype and column dtypes, non-null values, and memory usage. It's a handy way to quickly get an overview of the DataFrame's structure and the data it contains.

In [178]:
# importing the module
import pandas as pd
 
# creating a sample dataframe
data = pd.DataFrame({'Brand': ['Maruti', 'Hyundai', 'Tata',
                               'Mahindra', 'Maruti', 'Hyundai',
                               'Renault', 'Tata', 'Maruti'],
                     'Year': [2012, 2014, 2011, 2015, 2012,
                              2016, 2014, 2018, 2019],
                     'Kms Driven': [50000, 30000, 60000,
                                    25000, 10000, 46000,
                                    31000,None, 12000],
                     'City': ['Gurgaon', 'Delhi', 'Mumbai',
                              'Delhi', 'Mumbai', 'Delhi',
                              'Mumbai', 'Chennai',  'Ghaziabad'],
                     'Mileage':  [28, 27, 25, 26, 28,
                                  29, 24, 21, 24]})
 

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Brand       9 non-null      object 
 1   Year        9 non-null      int64  
 2   Kms Driven  8 non-null      float64
 3   City        9 non-null      object 
 4   Mileage     9 non-null      int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 492.0+ bytes


### describe()
The describe() method in Pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution. It provides statistical information about numerical columns in a DataFrame.

In [179]:
data.describe()

Unnamed: 0,Year,Kms Driven,Mileage
count,9.0,8.0,9.0
mean,2014.555556,33000.0,25.777778
std,2.74368,17864.569884,2.538591
min,2011.0,10000.0,21.0
25%,2012.0,21750.0,24.0
50%,2014.0,30500.0,26.0
75%,2016.0,47000.0,28.0
max,2019.0,60000.0,29.0


### head() and tail()
The head() and tail() methods in Pandas are used to view a small portion of a DataFrame. They are helpful for quickly examining the beginning or end of a DataFrame to get a sense of its structure or contents.

In [180]:
data.head()

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000.0,Gurgaon,28
1,Hyundai,2014,30000.0,Delhi,27
2,Tata,2011,60000.0,Mumbai,25
3,Mahindra,2015,25000.0,Delhi,26
4,Maruti,2012,10000.0,Mumbai,28


In [181]:
data.head(6)

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000.0,Gurgaon,28
1,Hyundai,2014,30000.0,Delhi,27
2,Tata,2011,60000.0,Mumbai,25
3,Mahindra,2015,25000.0,Delhi,26
4,Maruti,2012,10000.0,Mumbai,28
5,Hyundai,2016,46000.0,Delhi,29


In [182]:
data.tail()

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
4,Maruti,2012,10000.0,Mumbai,28
5,Hyundai,2016,46000.0,Delhi,29
6,Renault,2014,31000.0,Mumbai,24
7,Tata,2018,,Chennai,21
8,Maruti,2019,12000.0,Ghaziabad,24


In [183]:
data.tail(6)

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
3,Mahindra,2015,25000.0,Delhi,26
4,Maruti,2012,10000.0,Mumbai,28
5,Hyundai,2016,46000.0,Delhi,29
6,Renault,2014,31000.0,Mumbai,24
7,Tata,2018,,Chennai,21
8,Maruti,2019,12000.0,Ghaziabad,24


### value_counts()
The value_counts() method in Pandas is used to count the occurrences of unique values in a column of a DataFrame. It's particularly useful for understanding the distribution of values within a specific column.

In [184]:
data['City'].value_counts()

City
Delhi        3
Mumbai       3
Gurgaon      1
Chennai      1
Ghaziabad    1
Name: count, dtype: int64

In [185]:
data['Brand'].value_counts()

Brand
Maruti      3
Hyundai     2
Tata        2
Mahindra    1
Renault     1
Name: count, dtype: int64

## Data Cleaning and Handling

### drop()
Drop specified rows or columns.

In [189]:
import pandas as pd
data={
    'A':[1,2,3,4],
    'B':[5,6,7,8],
    'C':[9,10,11,12],
    'D':[13,14,15,16]
}
df=pd.DataFrame(data)
display(df)

Unnamed: 0,A,B,C,D
0,1,5,9,13
1,2,6,10,14
2,3,7,11,15
3,4,8,12,16


In [190]:
#Dropping a column C by Specifying axis=1
drop_col=df.drop('C',axis=1)
print(drop_col)

   A  B   D
0  1  5  13
1  2  6  14
2  3  7  15
3  4  8  16


In [191]:
#Drop Row 1 by Specifing axis=0
drop_row=df.drop(1,axis=0)
print(drop_row)

   A  B   C   D
0  1  5   9  13
2  3  7  11  15
3  4  8  12  16


In [192]:
#Inplace
df.drop('B',axis=1,inplace=True)
df

Unnamed: 0,A,C,D
0,1,9,13
1,2,10,14
2,3,11,15
3,4,12,16


In [193]:
df.drop([2,3])

Unnamed: 0,A,C,D
0,1,9,13
1,2,10,14


In [194]:
df

Unnamed: 0,A,C,D
0,1,9,13
1,2,10,14
2,3,11,15
3,4,12,16


### fillna()
Fill missing values in the DataFrame.

In [198]:
df_with_null=pd.DataFrame({
    'A':[1,2,None,4,5],
    'B':[2,3,None,5,None]
})
df_with_null

Unnamed: 0,A,B
0,1.0,2.0
1,2.0,3.0
2,,
3,4.0,5.0
4,5.0,


In [202]:
#Filling Missing values with specific value
df_filled=df_with_null.fillna(0)
df_filled

Unnamed: 0,A,B
0,1.0,2.0
1,2.0,3.0
2,0.0,0.0
3,4.0,5.0
4,5.0,0.0


### drop_duplicates()
Remove duplicate rows.

In [204]:
data_duplicated={
    'A':[1,2,1,2,3],
    'B':['x','y','x','y','z']
}
df=pd.DataFrame(data_duplicated)
df

Unnamed: 0,A,B
0,1,x
1,2,y
2,1,x
3,2,y
4,3,z


In [209]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,A,B
0,1,x
1,2,y
4,3,z


### replace()
Replace values in the DataFrame.

In [212]:
df.replace({'x':'Sami',3:300},inplace=True)
df

Unnamed: 0,A,B
0,1,Sami
1,2,y
4,300,z


## Data Aggregation and Grouping

### groupby()
Group data based on specified columns.

In [218]:
data={
    'Cat':['A','A','B','B','A'],
   'Value':[1,2,3,4,5] 
}
df=pd.DataFrame(data)
df

Unnamed: 0,Cat,Value
0,A,1
1,A,2
2,B,3
3,B,4
4,A,5


In [222]:
grouped=df.groupby('Cat')

for name,group in grouped:
    print(name)
    print(group)

A
  Cat  Value
0   A      1
1   A      2
4   A      5
B
  Cat  Value
2   B      3
3   B      4


### agg()
Apply aggregation functions (sum, mean, etc.) on grouped data.

In [230]:
agg=grouped.agg({'Value':['sum','mean']})
#agg=grouped.agg({'Value':'sum'})
agg

Unnamed: 0_level_0,Value,Value
Unnamed: 0_level_1,sum,mean
Cat,Unnamed: 1_level_2,Unnamed: 2_level_2
A,8,2.666667
B,7,3.5


## Data Transformation

### pivot_table()

In [237]:
import pandas as pd
data={
   'Date':['2022-01-01','2022-01-01','2022-01-02','2022-01-02'],
    'Name':['A','B','A','B'],
    'Price':[10,20,15,25]
}
df=pd.DataFrame(data)
df

Unnamed: 0,Date,Name,Price
0,2022-01-01,A,10
1,2022-01-01,B,20
2,2022-01-02,A,15
3,2022-01-02,B,25


In [238]:
# Creating a pivot table to summarize data
pivot = df.pivot_table(values='Price', index='Date', columns='Name', aggfunc='sum')
pivot

Name,A,B
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-01-01,10,20
2022-01-02,15,25


### astype()

In [239]:
data={
    'Num':[1,2,3]
}
df=pd.DataFrame(data)
df

Unnamed: 0,Num
0,1
1,2
2,3


In [240]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Num     3 non-null      int64
dtypes: int64(1)
memory usage: 156.0 bytes


In [246]:
#convert to string
df['Num']=df['Num'].astype(str)

In [247]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Num     3 non-null      object
dtypes: object(1)
memory usage: 156.0+ bytes


## Read/Write CSV/Excel files

In [249]:
import pandas as pd
df=pd.read_csv('healthcare-dataset-stroke-data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [250]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [251]:
df.tail()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.2,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,44679,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


### Writing in CSV file

In [253]:
import pandas as pd

# Creating a DataFrame from a Dictionary
data = {
    'Name' : ['Alice', 'Bob', 'Sami','Tahsin','Sahil','Mushfiq'],
    'Age' : [12,23,13,22,22,22],
    'City' : ["NY", "UK", "Dhaka","Khulna",'Dhaka','Dhaka']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Sami,13,Dhaka
3,Tahsin,22,Khulna
4,Sahil,22,Dhaka
5,Mushfiq,22,Dhaka


In [254]:
df.to_csv('outputcsv',index=False)

### Reading Excel File

In [256]:
#Read excel file
df=pd.read_excel('students.xlsx',sheet_name='Sheet1')

In [257]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Id      3 non-null      int64  
 1   name    3 non-null      object 
 2   dept    3 non-null      object 
 3   cgpa    3 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 228.0+ bytes


In [258]:
df.head()

Unnamed: 0,Id,name,dept,cgpa
0,111,Jon Doe,ECE,3.9
1,121,Jon Snow,BBA,3.33
2,232,Alice Cuba,ECO,3.55


In [259]:
df.tail()

Unnamed: 0,Id,name,dept,cgpa
0,111,Jon Doe,ECE,3.9
1,121,Jon Snow,BBA,3.33
2,232,Alice Cuba,ECO,3.55


### Writing Excel File

In [260]:
import pandas as pd

# Creating a DataFrame from a Dictionary
data = {
    'Name' : ['Alice', 'Bob', 'Sami','Tahsin','Sahil','Mushfiq'],
    'Age' : [12,23,13,22,22,22],
    'City' : ["NY", "UK", "Dhaka","Khulna",'Dhaka','Dhaka']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Sami,13,Dhaka
3,Tahsin,22,Khulna
4,Sahil,22,Dhaka
5,Mushfiq,22,Dhaka


In [262]:
df.to_excel('exceloutput.xlsx',sheet_name='Sheet1')