# Python-Data-Cleaning

### 1. Load in the data

The code to load in the data is provided below. 

The `with open(..., 'r') as f:` opens up a file in "read" mode (rather than "write"), and assigns this opened file to `f`. 

We can then use the `.readlines()` built-in function to split the csv file on newlines and assign it to the variable `lines`.

In [1]:
file = "coffee-preferences.csv"
with open(file,'r') as f:
    lines = f.readlines()

### 2. Examine the `lines` object

Print out lines (if you just run the cell with `lines` and no `print` statement it will format it better). 

What things we can see right off the bat about the data?

In [2]:
lines

# 1. There are still '\n' newline characters that were not removed by .readlines()
# 2. Many people have missing ratings for different coffee brands.

['Timestamp,Name,Starbucks,PhilzCoffee,BlueBottleCoffee,PeetsTea,CaffeTrieste,GrandCoffee,RitualCoffee,FourBarrel,WorkshopCafe\n',
 '3/17/2015 18:37:58,Alison,3,5,4,3,,,5,5,\n',
 '3/17/2015 18:38:09,April,4,5,5,3,,,3,,5\n',
 '3/17/2015 18:38:25,Vijay,3,5,5,5,3,2,1,1,1\n',
 '3/17/2015 18:38:28,Vanessa,1,5,5,2,,,3,2,3\n',
 '3/17/2015 18:38:46,Isabel,1,4,4,2,4,,4,4,\n',
 '3/17/2015 18:39:01,India,5,3,3,3,3,1,,,3\n',
 '3/17/2015 18:39:01,Dave H,4,5,,5,,,,,\n',
 '3/17/2015 18:39:05,Deepthi,3,5,,2,,,,,2\n',
 '3/17/2015 18:39:14,Ramesh,3,4,,3,,,,,4\n',
 '3/17/2015 18:39:23,Hugh Jass,1,5,5,4,5,2,5,4,1\n',
 '3/17/2015 18:39:23,Alex,4,5,,3,,,,,\n',
 '3/17/2015 18:39:30,Ajay Anand,3,4,4,3,5,,,,\n',
 '3/17/2015 18:39:35,David Feng,2,3,4,2,2,,5,4,3\n',
 '3/17/2015 18:39:42,Zach,3,4,4,3,,,,,5\n',
 '3/17/2015 18:40:44,Matt,3,5,4,3,2,2,4,3,2\n',
 '3/17/2015 18:40:49,Markus,3,5,,3,,,4,,\n',
 '3/17/2015 18:41:18,Otto,4,2,2,5,,,3,3,3\n',
 '3/17/2015 18:41:23,Alessandro,1,5,3,2,,,4,3,\n',
 '3/17/2015 18:4

---

### 3. Remove the remaining newline `'\n'` characters with a for-loop

Iterate through the lines of the data and remove the unwanted newline characters.

**`.replace('\n', '')`** is a built-in string function that will take as the first argument the substring you want to replace, and as its second argument the string you want to replace it with.


In [3]:
cleaned_lines = []
for l in lines:  
    cleaned_lines.append(l.replace('\n',''))
cleaned_lines

['Timestamp,Name,Starbucks,PhilzCoffee,BlueBottleCoffee,PeetsTea,CaffeTrieste,GrandCoffee,RitualCoffee,FourBarrel,WorkshopCafe',
 '3/17/2015 18:37:58,Alison,3,5,4,3,,,5,5,',
 '3/17/2015 18:38:09,April,4,5,5,3,,,3,,5',
 '3/17/2015 18:38:25,Vijay,3,5,5,5,3,2,1,1,1',
 '3/17/2015 18:38:28,Vanessa,1,5,5,2,,,3,2,3',
 '3/17/2015 18:38:46,Isabel,1,4,4,2,4,,4,4,',
 '3/17/2015 18:39:01,India,5,3,3,3,3,1,,,3',
 '3/17/2015 18:39:01,Dave H,4,5,,5,,,,,',
 '3/17/2015 18:39:05,Deepthi,3,5,,2,,,,,2',
 '3/17/2015 18:39:14,Ramesh,3,4,,3,,,,,4',
 '3/17/2015 18:39:23,Hugh Jass,1,5,5,4,5,2,5,4,1',
 '3/17/2015 18:39:23,Alex,4,5,,3,,,,,',
 '3/17/2015 18:39:30,Ajay Anand,3,4,4,3,5,,,,',
 '3/17/2015 18:39:35,David Feng,2,3,4,2,2,,5,4,3',
 '3/17/2015 18:39:42,Zach,3,4,4,3,,,,,5',
 '3/17/2015 18:40:44,Matt,3,5,4,3,2,2,4,3,2',
 '3/17/2015 18:40:49,Markus,3,5,,3,,,4,,',
 '3/17/2015 18:41:18,Otto,4,2,2,5,,,3,3,3',
 '3/17/2015 18:41:23,Alessandro,1,5,3,2,,,4,3,',
 '3/17/2015 18:41:35,Rocky,3,5,4,3,3,3,4,4,3',
 '3/17/

---

### 4. Split the lines into "header" and "data" variables

The header is the first string in the list of strings. It contains the column names of our data.

In [4]:
header = cleaned_lines[0]
data = cleaned_lines[1:]

In [5]:
header

'Timestamp,Name,Starbucks,PhilzCoffee,BlueBottleCoffee,PeetsTea,CaffeTrieste,GrandCoffee,RitualCoffee,FourBarrel,WorkshopCafe'

In [6]:
data[:5]

['3/17/2015 18:37:58,Alison,3,5,4,3,,,5,5,',
 '3/17/2015 18:38:09,April,4,5,5,3,,,3,,5',
 '3/17/2015 18:38:25,Vijay,3,5,5,5,3,2,1,1,1',
 '3/17/2015 18:38:28,Vanessa,1,5,5,2,,,3,2,3',
 '3/17/2015 18:38:46,Isabel,1,4,4,2,4,,4,4,']

---

### 5. Split the header and the data strings on commas

To split a string on the comma character, use the **`.split(',')`** built-in function. 

Split the header on commas first and print it. You can see that the original string is now a list, with items that were originally separated by commas.

In [11]:
# split on commas
#header = header.split(',')
print header

split_data = []
for r in data:
    split_data.append(r.split(','))

Timestamp,Name,Starbucks,PhilzCoffee,BlueBottleCoffee,PeetsTea,CaffeTrieste,GrandCoffee,RitualCoffee,FourBarrel,WorkshopCafe


In [10]:
split_data

[['3/17/2015 18:37:58', 'Alison', '3', '5', '4', '3', '', '', '5', '5', ''],
 ['3/17/2015 18:38:09', 'April', '4', '5', '5', '3', '', '', '3', '', '5'],
 ['3/17/2015 18:38:25', 'Vijay', '3', '5', '5', '5', '3', '2', '1', '1', '1'],
 ['3/17/2015 18:38:28', 'Vanessa', '1', '5', '5', '2', '', '', '3', '2', '3'],
 ['3/17/2015 18:38:46', 'Isabel', '1', '4', '4', '2', '4', '', '4', '4', ''],
 ['3/17/2015 18:39:01', 'India', '5', '3', '3', '3', '3', '1', '', '', '3'],
 ['3/17/2015 18:39:01', 'Dave H', '4', '5', '', '5', '', '', '', '', ''],
 ['3/17/2015 18:39:05', 'Deepthi', '3', '5', '', '2', '', '', '', '', '2'],
 ['3/17/2015 18:39:14', 'Ramesh', '3', '4', '', '3', '', '', '', '', '4'],
 ['3/17/2015 18:39:23',
  'Hugh Jass',
  '1',
  '5',
  '5',
  '4',
  '5',
  '2',
  '5',
  '4',
  '1'],
 ['3/17/2015 18:39:23', 'Alex', '4', '5', '', '3', '', '', '', '', ''],
 ['3/17/2015 18:39:30', 'Ajay Anand', '3', '4', '4', '3', '5', '', '', '', ''],
 ['3/17/2015 18:39:35',
  'David Feng',
  '2',
  '3',


---

### 6. Remove the "Timestamp" column

We aren't interested in the Timestamp column in our data, so remove it from the header and the data list.

Removing the Timestamp from the header is simple and can be done with list functions or slicing. To remove the header column from the data, use a for-loop.

Print out the new data object with the timestamps removed.

In [None]:
# remove Timestamp:
header = header[1:]
data_nots = []
for row in split_data:
    data_nots.append(row[1:])

In [None]:
header

In [None]:
data_nots

---

### 7. Convert numeric columns to floats and empty fields to None

Iterate through the data and construct a new data list of lists that has the numeric ratings converted from strings to floats, and the empty fields (which are empty strings `''`) replaced with the `None` object.

Use a nested for-loop (for-loop within another for-loop) to get the job done. You will likely need to use if/else conditional statements as well!

Print out the new data object to make sure it worked.


In [None]:
mylist = ['cat', 'dog', 'mouse']
cities = ['LA', 'SM', 'Venice']
new_list = [[mylist] + [cities]]
new_list
#zip(mylist, cities)

In [None]:
# convert numeric columns to floats, empty fields to None
data_num = []
for row in data_nots:
    new_row = []
    for i, col in enumerate(row):
        if i == 0:
            new_row.append(col)
        else:
            if col == '':
                new_row.append(None)
            else:
                new_row.append(float(col))
    data_num.append(new_row)

In [None]:
data_num

---

### 8. Count the `None` values per person

Use a for loop to count the number of `None` values per person. Create a dictionary with the names of the people as keys, and the counts of `None` as values.

Who rated the most coffee brands? Who rated the least?

In [None]:
user_nones = {}
for row in data_num:
    nones = 0
    for cell in row:
        if cell == None:
            nones += 1
    user_nones[row[0]] = nones
    
user_nones

# Least: Alex, Dave H, cheong-tseng eng
# Most: Hugh Jass, Matt, Rocky, Vijay

---

### 9. Calculate average rating per coffee brand

**Excluding `None` values**, calculate the average rating per brand of coffee.

The final output should be a dictionary with keys as the coffee brand names, and their average rating as the values.

Remember that average can be calculated as the sum of the ratings over the number of ratings:

```python
average_rating = sum(ratings_list)/len(ratings_list)
```

Print your dictionary to see the average brand ratings.

In [None]:
header

In [None]:
brands

In [None]:
# which brands of coffee have the highest average and lowest rating, excluding None values?

brand_ratings = {}
for brand in header[1:]:
    brand_ratings[brand] = []

for row in data_num:
    for i, cell in enumerate(row):
        if i > 0 and not cell == None:
            brand_ratings[header[i]].append(cell)

brand_avg_ratings = {}
for brand, ratings in brand_ratings.items():
    #print  brand, ratings
    brand_avg_ratings[brand] = sum(ratings)/len(ratings)

#brand_ratings
brand_avg_ratings

---

### 10. Create a list of just the people

This will be a list of the peoples' names. This can be done with a for-loop.

In [None]:
people = []
for row in data_num:
    people.append(row[0])
    
print len(people)
people