The class was given a survey and asked the question:
```
What classes are you taking?
```

This is a simple question on the surface, but has hidden complexity that a data scientist can see through.
Let's look at the responses:

In [None]:
import matplotlib.pyplot
import wordcloud

courses = []

with open('responses.txt', 'r') as file:
        for line in file:
            courses.append(line.strip())

courses[0:15]

Let's make a method to help us visualize the data.

In [None]:
def visualize(words):
    cloud = wordcloud.WordCloud().generate('\t'.join(words))
        
    matplotlib.pyplot.figure()
    matplotlib.pyplot.imshow(cloud, interpolation = "bilinear")
    matplotlib.pyplot.axis("off")
    matplotlib.pyplot.show()

In [None]:
visualize(courses)

On first glance, this looks really cool!

But if we look a little deeper, we start to see some problems...
 - "CSE" is on it's own. There is no course just called CSE.
 - There are multiple instances of "CSE40" floating around.

What's going on?
Let's look closer at the data.

In [None]:
# Only look for individual values:
list(sorted(set(courses)))[0:15]

Looks like there may be more than one course per line.
Maybe the word cloud library was having trouble with those.
Let's split those up.

In [None]:
new_courses = []

for line in courses:
    split_line = line.split(';')
    new_courses += split_line

courses = new_courses

Now let's look at our courses again.

In [None]:
print(list(sorted(set(courses)))[0:15])
visualize(courses)

Looking better, but we can still see some mistakes,
like 'CSE' is still in the word cloud alone.

If we look at the raw words, it looks like some entered in "CSE 115C" instead of "CSE115C".
In fact, it looks like there are several more spacing issues in our words, like ' CSE114A' and 'CSE114A'.
We can clean this up by just removing all white space from our words.

In [None]:
courses = [course.replace(' ', '') for course in courses]

# If we wanted to be really cool, we would use a regular expression instead.
# courses = [course.replace(r'\s+', '') for course in courses]

Let's try again.

In [None]:
print(list(sorted(set(courses)))[0:15])
print(list(sorted(set(courses)))[40:50])
visualize(courses)

Much cleaner!

But if we look at the raw words, we see that there is some inconsistency with casing.
Let's convert them all to the same case.

In [None]:
courses = [course.upper() for course in courses]

Visualize.

In [None]:
print(list(sorted(set(courses)))[0:15])
print(list(sorted(set(courses)))[40:50])
visualize(courses)

Fantastic!

But, there is still one issue:
What about that empty value in the first index of our list?
Why is it there?
It is a valid value?

It may be a valid value...
What if there are students not taking any other classes and they didn't put "CSE40"?
We may want to know that, but empty strings don't show up in the word cloud.

To fix this, we should replace the empty value with a known value.
Let's use: '__NONE__'.

In [None]:
courses = ['__NONE__' if course == '' else course for course in courses]

Now to visualuze the results.

In [None]:
print(list(set(courses))[0:15])
visualize(courses)

Finally, we can see people who are not taking other classes.

But, handling this case raises another question:
Should we include "CSE40" in the results?
The question we asked was:
```
What classes are you taking?
```
But it looks like some people interpreted it as:
```
What classes are you taking aside from CSE40?
```

In this case, it looks like we have an issue with our data that stems all the way from how we collected the data.
We may not have been clear enough and this caused our data to come out noisier than we would like.

The help clean things up, we can just remove all instances of CSE40.

In [None]:
# Since we already made cases consistent and removed white space, we can just do a direct comparison.
courses = [course for course in courses if (course != 'CSE40')]

In [None]:
print(list(set(courses))[0:15])
visualize(courses)

Now we can see an accurate representation of the other course people in CSE40 are taking.

At the beginning of this process, we may have thought that our data was already in a pretty good state.
But by working with it and exploring it, we have seen how much improvement our data has room form.
Some changes were simple ones that just required some manipulation,
but some changes required asking ourselves deep questions about how the data was collected in the first place.