## Introduction

Data scientists commonly spend over half their time cleaning data, so knowing how to clean "messy" data is an extremely important skill.

### Reading CSV Files with Encodings

We'll learn the basics of data cleaning with pandas as we work with `laptops.csv`, a CSV file containing information about 1,300 laptop computers.

Computers, at their lowest levels, can only understand binary - 0 and 1- and encodings are systems for representing characters in binary.
Something we can do if our file has an unknown encoding is to try the most common encodings:
- UTF-8
- Latin-1 (also known as ISO-8859-1)
- Windows-1251

In [1]:
import pandas as pd

laptops = pd.read_csv("data/laptops.csv", encoding='Latin-1')
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 66.2+ KB


### Cleaning Column Names

We can see that every column is represented as the `object` type, indicating that they are represented by strings, not numbers. Also, one of the columns, `Operating System Version`, has null values.

The column labels have a variety of upper and lowercase letters, as well as spaces and parentheses, which will make them harder to work with and read. One noticeable issue is that the `" Storage"` column name has a space in front of it. These quirks with column labels can sometimes be hard to spot, so removing extra whitespaces from all column names will save us more work in the long run.

We can access the column axis of a dataframe using the `DataFrame.columns` attribute. This returns an index object — a special type of NumPy ndarray — with the labels of each column

In [2]:
new_columns = []

for col in laptops.columns:
    column = col.strip() # Remove whitespaces
    new_columns.append(column)
    
laptops.columns = new_columns # Assign clean column names to the original DF

In [3]:
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
Storage                     1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 66.2+ KB


The column labels still have a variety of upper and lowercase letters, as well as parentheses, which will make them harder to work with and read. Let's finish cleaning our column labels by:
- Replacing spaces with underscores.
- Removing special characters.
- Making all labels lowercase.
- Shortening any long column names.

In [4]:
def clean_col(col_name):
    col = col_name.strip() # Remove whitespaces
    col = col.replace("Operating System", "os") # Replace substring
    col = col.replace(" ", "_") # Replace spaces with underscores
    # Replace parenthesis
    col = col.replace("(", "") 
    col = col.replace(")", "")
    col = col.lower()
    return col

new_columns = []

for column in laptops.columns:
    column_name = clean_col(column)
    new_columns.append(column_name)
    
laptops.columns = new_columns

In [5]:
laptops.columns

Index(['manufacturer', 'model_name', 'category', 'screen_size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'os', 'os_version', 'weight',
       'price_euros'],
      dtype='object')

### Converting String Columns to Numeric

