DATA CLEANING BASICS

1. Detecting Missing Values in a CSV File.

In [9]:
import pandas as pd

df=pd.read_csv("retail_store.csv")

missing_values=df.isnull().sum()

print(f"missing values - {missing_values}")

print(df)

missing values - Product     1
Price       1
Category    1
dtype: int64
  Product  Price     Category
0  Laptop  800.0  Electronics
1   Shirt   25.0          NaN
2  Laptop  800.0  Electronics
3     NaN   50.0     Clothing
4   Phone    NaN  Electronics


2. Filling the Missing Values in a CSV File.

In [10]:
df["Price"].fillna(df["Price"].mean().sum(),inplace=True)

df["Product"].fillna("Unknown",inplace=True)

print(df)

   Product   Price     Category
0   Laptop  800.00  Electronics
1    Shirt   25.00          NaN
2   Laptop  800.00  Electronics
3  Unknown   50.00     Clothing
4    Phone  418.75  Electronics


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Price"].fillna(df["Price"].mean().sum(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Product"].fillna("Unknown",inplace=True)


3. Dropping Rows with Missing Values.

In [11]:
df=df.dropna()

print(df)

   Product   Price     Category
0   Laptop  800.00  Electronics
2   Laptop  800.00  Electronics
3  Unknown   50.00     Clothing
4    Phone  418.75  Electronics


4. Make a Function that Drops the Rows with Missing Values.

In [12]:
import pandas as pd

def Clean_data(file_name):
    df=pd.read_csv(file_name)
    df=df.dropna()
    return df

Clean_data("retail_store.csv")

Unnamed: 0,Product,Price,Category
0,Laptop,800.0,Electronics
2,Laptop,800.0,Electronics


5. Make a Function that Cleans the Text Data.

In [13]:
import re

def Clean_text(file_name):
    with open(file_name,"r",encoding="utf-8",errors="ignore") as file:
        text=file.read()
        text=text.strip()
        text=re.sub(r"[^\w\s]","",text)
        text=text.lower()
    return text

Clean_text("customer_reviews.txt")

'excellent product the quality is topnotch and the delivery was super fast not happy with the battery life it drains too quickly even on standby great value for money ive been using it for a week and it works perfectly the customer service was not helpful i had an issue and they took too long to respond loved the design and performance feels premium and wellbuilt the product arrived with a minor scratch not a big deal but still disappointing very satisfied works as expected and even better than described waste of money stopped working after just two days highly recommended i bought it for my brother and he loves it delivery was delayed by five days the product is good but the experience wasnt great sound quality is amazing best ive experienced in this price range overheats after prolonged usage not ideal for heavy tasks packaging was impressive and the setup process was hasslefree the screen resolution is not as sharp as expected looks a bit outdated best purchase ive made this year wo

6. Removing Duplicates entries in a list.

In [14]:
data = {
    "ID": [101, 102, 103, 101, 104, 102, 105],
    "Name": ["Alice", "Bob", "Charlie", "Alice", "David", "Bob", "Eve"],
    "Age": [25, 30, 35, 25, 40, 30, 45],
}

df=pd.DataFrame(data)
unique_data=df.drop_duplicates()
print(unique_data)

    ID     Name  Age
0  101    Alice   25
1  102      Bob   30
2  103  Charlie   35
4  104    David   40
6  105      Eve   45


7. Extracting Keywords from Text.

In [15]:
strings = [
    "Apple - Fresh and Organic",
    "Samsung Galaxy S21 - 128GB",
    "Python Programming - Beginner to Advanced",
    "Tesla Model S - Electric Car",
    "Nike Air Max - Running Shoes",
    "The Great Gatsby - Classic Novel",
    "Sony WH-1000XM4 - Noise Cancelling Headphones",
    "Coca-Cola - Refreshing Drink",
    "Amazon Echo - Smart Speaker",
    "Dell XPS 13 - Laptop with Intel i7"
]

strings_list=[item.split("-")[0].strip() for item in strings]

print(strings_list)

['Apple', 'Samsung Galaxy S21', 'Python Programming', 'Tesla Model S', 'Nike Air Max', 'The Great Gatsby', 'Sony WH', 'Coca', 'Amazon Echo', 'Dell XPS 13']
