# Activity 4.1 - Cleaning Walmart Data the OpenRefine Way

In this activity, you will practice what you learned in Lecture 4.5 by cleaning up a data set containing information on various Walmart locations.

In [1]:
import pandas as pd
from dfply import *

#### Initial Tasks

1. Try to read in the `./data/Walmart_United_States_&_Canada.csv` file and verify that you get an encoding error.  This means that the [character encoding](https://en.wikipedia.org/wiki/Character_encoding) isn't the default of `utf-8`.  The easiest way to fix this is to open and save the file in Visual Studio Code.

2. Read in the data to verify that the encoding is fixed, but that there are two more problems.  What are they?

In [19]:
#Your code here

In [13]:
# Your code here
walmart = pd.read_csv("./data/Walmart_United_States_&_Canada.csv")
walmart.head()

Unnamed: 0,-114.005671,51.262567,"Walmart Supercentre; #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-1295"
0,-111.900542,50.577939,"Walmart Supercentre; #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111"
1,-114.039133,51.107253,"Walmart Supercentre; #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(40..."
2,-114.138488,51.040871,"Walmart Supercentre; #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) ..."
3,-114.028603,50.930551,"Walmart; #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,..."
4,-113.91159,51.04009,"Walmart Supercentre; #1136,","255 E Hills Blvd SE,Calgary ,AB T2A 4X7,(403) ..."


<font color="blue"> No headers, Name and store number together, Additional optional "gas" field, More commas as address separators, Address and phone concatenated </font>

3. Take another look at the file in VS Code and determine solutions to the two/three issues, then read in the data correctly by passing `pd.read_csv` the correct defaults for this data. **Note.** Leave the `"` in place for now, as they serve an important role here!

In [4]:
help(pd.read_csv) # This might help!

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', sep=<no_default>, delimiter=None, header='infer', names=<no_default>, index_col=None, usecols=None, squeeze=None, prefix=<no_default>, mangle_dupe_cols=True, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression: 'CompressionOptions' = 'infer', thousands=None, decimal: 'str' = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors: 'str | None' = 'strict', dialect=None, error_bad_li

<font color="blue"> names function allows us to have headers </font>

In [14]:
# Your code here
walmart = pd.read_csv("./data/Walmart_United_States_&_Canada.csv", names=["Lat","Long","Description","Address_Phone"])
walmart.head()

Unnamed: 0,Lat,Long,Description,Address_Phone
0,-114.005671,51.262567,"Walmart Supercentre; #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-..."
1,-111.900542,50.577939,"Walmart Supercentre; #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111"
2,-114.039133,51.107253,"Walmart Supercentre; #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(40..."
3,-114.138488,51.040871,"Walmart Supercentre; #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) ..."
4,-114.028603,50.930551,"Walmart; #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,..."


## Cleaning up the store information.

As hinted at above, the presence of the `"` meant the two of the columns--one containing the store type/number and the other contain the address/phone number--are combined together.  This was done because some of these entries have a different number of variables.  For example, the store type/number column sometimes occasionally `Gas`.

In this part of the activity, you should apply the iterative OpenRefine approach to separate the information in the store column.

**Warning!** There is one entry that doesn't follow the same pattern as the rest.  You won't find this entry unless you carefully define/fix/eliminate patterns.

In [28]:
from more_dfply import case_when, ifelse
from more_dfply.facets import text_facet, text_filter

In [49]:
# View cell
(walmart
 >> select(X.Description)
 >> filter_by(~text_filter(X.Description,'Gas/Diesel', regex=True))
 >> filter_by(~text_filter(X.Description,'Gas', regex=True))
 >> filter_by(~text_filter(X.Description,'Walmart( .*)?(;|,)\s?#\d{4}', regex=True))
 >> filter_by(~text_filter(X.Description,"(Murphy|Wm|Sam's)", regex=True))
 >> filter_by(~text_filter(X.Description,"; Supercenter", regex=True))

)

  return col.str.contains(pattern, case=case, regex=regex, na=na)
  return col.str.contains(pattern, case=case, regex=regex, na=na)


Unnamed: 0,Description


In [50]:
# Transform cell
(walmart
    >> mutate(Description = ifelse(X.Description.str.contains(" ; Supercenter"),
                                  X.Descrption.str.replace(';',''), X.Description))
    >> mutate(Description = X.Description.str.replace(";",","))
    >> mutate(Store_type = X.Description.str.split(',').str.get(0),
              Store_number = X.Description.str.split(',').str.get(1),
              Gas = X.Description.str.split(',').str.get(2))
    >> sample(10)
)

Unnamed: 0,Lat,Long,Description,Address_Phone,Store_type,Store_number,Gas
1486,-84.593536,30.557852,"Murphy: USA, #6861,Gas/Diesel,","1880 Pat Thomas Pkwy,Quincy,FL,32351 ,,(850) 6...",Murphy: USA,#6861,Gas/Diesel
1797,-81.45212,32.139018,"Sam's Club, #4820,Gas/Diesel,","15 Mill Creek Circle; I-95 Exit 104,Pooler,GA,...",Sam's Club,#4820,Gas/Diesel
180,-86.658454,34.736543,"Sam's Club, #4776,Gas,","5651 Holmes AvE NW,Huntsville,AL,35816 ,,(256)...",Sam's Club,#4776,Gas
2846,-70.962144,42.448564,"Walmart, #2139,","780 Lynnway,Lynn,MA,01905 ,(NOP),(781) 592-4300",Walmart,#2139,
5564,-97.754402,30.220294,"Walmart Supercenter, #1253,Gas/Diesel,","710 E Ben White Blvd; I-35 Exit 230,Austin,TX,...",Walmart Supercenter,#1253,Gas/Diesel
5629,-95.855299,32.545503,"Murphy: USA, #5712,Gas/Diesel,","601 E Hwy 243,Canton,TX,75103 ,,(903) 567-0946",Murphy: USA,#5712,Gas/Diesel
1221,-80.142424,26.650277,"Walmart Supercenter, #1436,","6294 Forest Hill Blvd,Greenacres,FL,33415 ,(NO...",Walmart Supercenter,#1436,
6370,-111.940413,41.10042,"Wm Nbrhd Mkt, #5205,","1356 E Hwy 193,Layton,UT,84040 ,(NOP),(801) 77...",Wm Nbrhd Mkt,#5205,
1225,-81.639008,28.125235,"Murphy: USA, #5659,Gas,","36115 Hwy 27,Haines City,FL,33844 ,,(863) 421-...",Murphy: USA,#5659,Gas
1861,-83.672122,32.619499,"Walmart Supercenter, #1367,","2720 Watson Blvd,Warner Robins,GA,31093 ,,(478...",Walmart Supercenter,#1367,


## Preview of Coming Attractions

In this module's homework assignment, you will continue to clean up this data set.