## What is a CSV?
CSV stands for “Comma Separated Values”. It is the simplest form of storing data in
tabular form as plain text. It is important to know how to work with CSV because we
mostly rely on CSV data in our day-to-day lives as data scientists.

How to write CSV Files in Python Using Pandas 
### to_csv()

In [2]:
import pandas as pd
# Create a Pandas DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = pd.DataFrame(data)

csv_file_path = 'output.csv'
# Write the DataFrame to a CSV file
df.to_csv(csv_file_path, index=False) 

print(f'Data has been written to {csv_file_path}')

Data has been written to output.csv


How to Read CSV Files in Python Using Pandas?

In [3]:
data = {
'OrderID': [101, 102, 103, 104, 105],
'Product': ['Widget', 'Gadget', 'Widget', 'Doodad', 'Widget'],
'Quantity': [10, 5, 0, 7, 15],
'Status': ['Shipped', 'Canceled', 'Canceled', 'Shipped', 'Shipped']
}
df = pd.DataFrame(data)

csv_file_path = 'output1.csv'
# Write the DataFrame to a CSV file
df.to_csv(csv_file_path, index=False) 

print(f'Data has been written to {csv_file_path}')

Data has been written to output1.csv


In [4]:
data=pd.read_csv("output1.csv")
print(data)
print()
data1=pd.read_csv("output1.csv",nrows=3)
print(data1) 
print()
data2=pd.read_csv("output1.csv",nrows=3,usecols=["Product","Status"])
print(data2)
print()
data3=pd.read_csv("output1.csv",nrows=3,usecols=[0,2,3])
print(data3)


   OrderID Product  Quantity    Status
0      101  Widget        10   Shipped
1      102  Gadget         5  Canceled
2      103  Widget         0  Canceled
3      104  Doodad         7   Shipped
4      105  Widget        15   Shipped

   OrderID Product  Quantity    Status
0      101  Widget        10   Shipped
1      102  Gadget         5  Canceled
2      103  Widget         0  Canceled

  Product    Status
0  Widget   Shipped
1  Gadget  Canceled
2  Widget  Canceled

   OrderID  Quantity    Status
0      101        10   Shipped
1      102         5  Canceled
2      103         0  Canceled


In [5]:
data=pd.read_csv("output1.csv")
print(data)
print()
data4=pd.read_csv("output1.csv",nrows=5,skiprows=[0,3])
print(data4)
print()
data4=pd.read_csv("output1.csv",index_col="OrderID")
print(data4)
print()
data5=pd.read_csv("output1.csv",header=2)
print(data5)

   OrderID Product  Quantity    Status
0      101  Widget        10   Shipped
1      102  Gadget         5  Canceled
2      103  Widget         0  Canceled
3      104  Doodad         7   Shipped
4      105  Widget        15   Shipped

   101  Widget  10   Shipped
0  102  Gadget   5  Canceled
1  104  Doodad   7   Shipped
2  105  Widget  15   Shipped

        Product  Quantity    Status
OrderID                            
101      Widget        10   Shipped
102      Gadget         5  Canceled
103      Widget         0  Canceled
104      Doodad         7   Shipped
105      Widget        15   Shipped

   102  Gadget   5  Canceled
0  103  Widget   0  Canceled
1  104  Doodad   7   Shipped
2  105  Widget  15   Shipped


In [6]:
data=pd.read_csv("output1.csv")
print(data)
print()
data6=pd.read_csv("output1.csv",names=["col1","col2","col3","col4","col5"])
print(data6)

   OrderID Product  Quantity    Status
0      101  Widget        10   Shipped
1      102  Gadget         5  Canceled
2      103  Widget         0  Canceled
3      104  Doodad         7   Shipped
4      105  Widget        15   Shipped

      col1     col2      col3      col4  col5
0  OrderID  Product  Quantity    Status   NaN
1      101   Widget        10   Shipped   NaN
2      102   Gadget         5  Canceled   NaN
3      103   Widget         0  Canceled   NaN
4      104   Doodad         7   Shipped   NaN
5      105   Widget        15   Shipped   NaN


In [7]:
print("Index")
data7=pd.read_csv("output1.csv")
print(data7.index)
print()
print("columns")
data8=pd.read_csv("output1.csv")
print(data8.columns)
print()
print("describe() function")
data8=pd.read_csv("output1.csv")
print(data8.describe())
print()
print("head function")
data9=pd.read_csv("output1.csv")
print(data9.head(  ))
print()
print("tail function ")
data10=pd.read_csv("output1.csv")
print(data10.tail())
print()
print("tail function ")
data11=pd.read_csv("output1.csv")
print(data11.tail(3))

Index
RangeIndex(start=0, stop=5, step=1)

columns
Index(['OrderID', 'Product', 'Quantity', 'Status'], dtype='object')

describe() function
          OrderID  Quantity
count    5.000000   5.00000
mean   103.000000   7.40000
std      1.581139   5.59464
min    101.000000   0.00000
25%    102.000000   5.00000
50%    103.000000   7.00000
75%    104.000000  10.00000
max    105.000000  15.00000

head function
   OrderID Product  Quantity    Status
0      101  Widget        10   Shipped
1      102  Gadget         5  Canceled
2      103  Widget         0  Canceled
3      104  Doodad         7   Shipped
4      105  Widget        15   Shipped

tail function 
   OrderID Product  Quantity    Status
0      101  Widget        10   Shipped
1      102  Gadget         5  Canceled
2      103  Widget         0  Canceled
3      104  Doodad         7   Shipped
4      105  Widget        15   Shipped

tail function 
   OrderID Product  Quantity    Status
2      103  Widget         0  Canceled
3      104  Doo

In [8]:
data12=pd.read_csv("output1.csv")
print(data12[1:3])
print()
print("info function ")
data13=pd.read_csv("output1.csv")
print(data13.info())

   OrderID Product  Quantity    Status
1      102  Gadget         5  Canceled
2      103  Widget         0  Canceled

info function 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   OrderID   5 non-null      int64 
 1   Product   5 non-null      object
 2   Quantity  5 non-null      int64 
 3   Status    5 non-null      object
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
None


In [9]:
print("info function ")
data13=pd.read_csv("output1.csv")
print(data13.info())
print()
print("info function ")
data14=pd.read_csv("output1.csv")
print(data14.info(verbose=False))
print()


info function 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   OrderID   5 non-null      int64 
 1   Product   5 non-null      object
 2   Quantity  5 non-null      int64 
 3   Status    5 non-null      object
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
None

info function 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 4 entries, OrderID to Status
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
None



In [10]:
print("info function ")
data15=pd.read_csv("output1.csv")
print(data15.info(max_cols=1))
print()
print("info function ")
data16=pd.read_csv("output1.csv")
print(data16.info(memory_usage=True))
print()

info function 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 4 entries, OrderID to Status
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
None

info function 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   OrderID   5 non-null      int64 
 1   Product   5 non-null      object
 2   Quantity  5 non-null      int64 
 3   Status    5 non-null      object
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
None



In [11]:
print("count function ")
data16=pd.read_csv("output1.csv")
print(data16.count())
print()

count function 
OrderID     5
Product     5
Quantity    5
Status      5
dtype: int64



In [12]:
data16=pd.read_csv("output1.csv")
print(data16.value_counts())
print()
#printing the nan values
print("Before Removing nan Values")
print(data16.info())
#removing the nan values
data.dropna(inplace=True)
print("After Removing nan Values")
print(data16.info())

print()
print()
#Find the duplicate values
duplicate_rows = data.duplicated()
#Printing the no of duplicate rows
print("Number of duplicate rows:", duplicate_rows.sum())

OrderID  Product  Quantity  Status  
101      Widget   10        Shipped     1
102      Gadget   5         Canceled    1
103      Widget   0         Canceled    1
104      Doodad   7         Shipped     1
105      Widget   15        Shipped     1
Name: count, dtype: int64

Before Removing nan Values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   OrderID   5 non-null      int64 
 1   Product   5 non-null      object
 2   Quantity  5 non-null      int64 
 3   Status    5 non-null      object
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
None
After Removing nan Values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   OrderID   5 non-null      int64 
 1   Product   5 non-null      object
 2   Quantity  5 non-null      in

## 6. Save the cleaned datase

In [13]:

import pandas as pd
import numpy as np

# Create sample data with null and duplicate values
data = {
    'Student ID': [10001, 10002, 10001, 10002, 10004, 10005],
    'Name': ['Alice Smith', 'Bob Jones', 'Alice Smith', 'Bob Jones', 'Diana Garcia', 'Emily Wilson'],
    'Age': [18, 19, 18, 20, 21, 18],  # Introducing null value
    'Grade': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Major': ['Computer Science', 'Mathematics', 'Computer Science', 'Mathematics', 'Business', 'Biology'],
    'GPA': [3.8, 3.5, 3.8, 3.5, 3.9, 3.7],
    'City': ['New York', 'Los Angeles', "New York", None, 'Miami', 'Houston']
}
df = pd.DataFrame(data)

csv_file_path = 'duplicates.csv'
df.to_csv(csv_file_path, index=False)

print(f'Data has been written to {csv_file_path}')

Data has been written to duplicates.csv


In [14]:
#import pandas
import pandas as pd

#read the data from the csv file
data = pd.read_csv('duplicates.csv')
#Before removing checking no of row and col
print("Before removing checking no of row and col")
print(data.shape)
print(data)
#Remove duplicate values
data.drop_duplicates(inplace=True)
#After removing checking no of row and col
print("After removing checking no of row and col")
print(data.shape)
data.dropna(inplace=True)
print("After removing Nan values no of row and col")
print(data.shape)
#Save the files
data.to_csv('cleaned_data.csv', index=False)
print("Data Saved ")
print(data)

Before removing checking no of row and col
(6, 7)
   Student ID          Name  Age Grade             Major  GPA         City
0       10001   Alice Smith   18     A  Computer Science  3.8     New York
1       10002     Bob Jones   19     B       Mathematics  3.5  Los Angeles
2       10001   Alice Smith   18     A  Computer Science  3.8     New York
3       10002     Bob Jones   20     B       Mathematics  3.5          NaN
4       10004  Diana Garcia   21     A          Business  3.9        Miami
5       10005  Emily Wilson   18     B           Biology  3.7      Houston
After removing checking no of row and col
(5, 7)
After removing Nan values no of row and col
(4, 7)
Data Saved 
   Student ID          Name  Age Grade             Major  GPA         City
0       10001   Alice Smith   18     A  Computer Science  3.8     New York
1       10002     Bob Jones   19     B       Mathematics  3.5  Los Angeles
4       10004  Diana Garcia   21     A          Business  3.9        Miami
5       10005

drop_duplicates() Function

In [15]:

import pandas as pd
import numpy as np

# Create sample data with null and duplicate values
data = {
    'Student ID': [10001, 10002, 10001, 10002, 10004, 10005],
    'Name': ['Alice Smith', 'Bob Jones', 'Alice Smith', 'Bob Jones', 'Diana Garcia', 'Emily Wilson'],
    'Age': [18, 19, 18, 20, 21, 18],  # Introducing null value
    'Grade': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Major': ['Computer Science', 'Mathematics', 'Computer Science', 'Mathematics', 'Business', 'Biology'],
    'GPA': [3.8, 3.5, 3.8, 3.5, 3.9, 3.7],
    'City': ['New York', 'Los Angeles', "New York", None, 'Miami', 'Houston']
}

df = pd.DataFrame(data)

csv_file_path = 'duplicates.csv'
df.to_csv(csv_file_path, index=False)

data = pd.DataFrame(data)
print(f'Data has been written to {csv_file_path}')
#Find the duplicate values
duplicate_rows = data.duplicated()
#Printing the no of duplicate rows
print("Before Removing Number of duplicate rows:", duplicate_rows.sum())
print(data[duplicate_rows])
#Removing duplicate values
data.drop_duplicates(inplace=True)
#Find the duplicate values
duplicate_rows = data.duplicated()
#Printing the no of duplicate rows
print("After Removing Number of duplicate rows:", duplicate_rows.sum())
print(data[duplicate_rows])
data.to_csv('after_drop.csv',index=False)
print(data)

Data has been written to duplicates.csv
Before Removing Number of duplicate rows: 1
   Student ID         Name  Age Grade             Major  GPA      City
2       10001  Alice Smith   18     A  Computer Science  3.8  New York
After Removing Number of duplicate rows: 0
Empty DataFrame
Columns: [Student ID, Name, Age, Grade, Major, GPA, City]
Index: []
   Student ID          Name  Age Grade             Major  GPA         City
0       10001   Alice Smith   18     A  Computer Science  3.8     New York
1       10002     Bob Jones   19     B       Mathematics  3.5  Los Angeles
3       10002     Bob Jones   20     B       Mathematics  3.5         None
4       10004  Diana Garcia   21     A          Business  3.9        Miami
5       10005  Emily Wilson   18     B           Biology  3.7      Houston


## gpt

In [16]:
import pandas as pd
import numpy as np

# Create sample student data with potential duplicates and missing values
data = {
    'Student ID': [1001, 1002, 1003, 1001, 1004],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David'],
    'Age': [18, 19, 20, 18, 21],
    'Grade': ['A', 'B', 'C', 'A', 'A'],
    'Major': ['Computer Science', 'Math', 'Physics', 'Computer Science', 'History'],
    'GPA': [3.8, 3.5, 3.2, np.nan, 3.9],  # Introduce a missing value
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston']
}

# Create DataFrame
df = pd.DataFrame(data)

# Drop duplicate rows based on all columns
df = df.drop_duplicates()
print(df)

# Display shape of the DataFrame
print("Shape of the DataFrame:", df.shape)

# Display information about the DataFrame
print("\nInformation about the DataFrame:")
print(df.info())

# Display descriptive statistics
print("\nDescriptive statistics:")
print(df.describe())

   Student ID     Name  Age Grade             Major  GPA         City
0        1001    Alice   18     A  Computer Science  3.8     New York
1        1002      Bob   19     B              Math  3.5  Los Angeles
2        1003  Charlie   20     C           Physics  3.2      Chicago
3        1001    Alice   18     A  Computer Science  NaN     New York
4        1004    David   21     A           History  3.9      Houston
Shape of the DataFrame: (5, 7)

Information about the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Student ID  5 non-null      int64  
 1   Name        5 non-null      object 
 2   Age         5 non-null      int64  
 3   Grade       5 non-null      object 
 4   Major       5 non-null      object 
 5   GPA         4 non-null      float64
 6   City        5 non-null      object 
dtypes: float64(1), int64(2), object(4)
memory