# Wrangling the Data

Real world data has problems. The most obvious, that we have seen multiple times so far is the presence of missing or incomplete data. Invalid or innaccurate data is a somewhat less common and much harder to detect problem (as we have discussed before we need to approach rejecting data with great caution). 

We will take this class to discuss a broad approach to data and cover some common problems that come up for various types of data. We will end with an introduction to dealing with strings and images, two types of data that Python provides powerful tools for working with.

## Formatting the Data

Putting aside the issue of Big Data (data that is so large that it cannot be loaded into a pandas.DataFrame all at once), the main goal in formatting the data is to set it up so that it can be represented by a pandas.DataFrame structure. Most of the examples we have seen in class have been delivered as CSV files that are ready to be read into a dataframe. The primary exceptions so far have been the Berlin Airbnb data that arrived as multiple cross referenced CSV files, and the image recognition example whose data was a collection of JPG files.

However I would like to build a new exception. Let's consider the collected works of Shakespeare, available here: [Works of Shakespeare](http://shakespeare.mit.edu/). And here as a text file: [Works of Shakespeare](https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt). This also means that we will be dealing with strings first, which is fine as we have already seen examples of dealing with Numerical Data and Categorical Data. You can also get a zipfile with the works as separate txt files here [Fermibot](https://fermibot.com/analysis-of-shakespeares-work-using-python-3/)

We will go through this as an example of what one needs to do in formatting data and also dealing with strings, but of course what happens in any specific example is different. The best I can recomend is that you do as many examples as you can so that you have as broad an experience as possible in solving the problems that come up.

In [61]:
# Reading in a text file line by line:

f = open('Data Sets/t8.shakespeare.txt')

f1 = f.readlines()

In [62]:
# f1 is a list of each line of the file as a string.
# The file has a lot of stuff in addition to Shakespeare's Sonnets and Plays

f1[0:5]

['This is the 100th Etext file presented by Project Gutenberg, and\n',
 'is presented in cooperation with World Library, Inc., from their\n',
 'Library of the Future and Shakespeare CDROMS.  Project Gutenberg\n',
 'often releases Etexts that are NOT placed in the Public Domain!!\n',
 '\n']

In [63]:
# Finding the start of the actual works involves searching the first instance of the string 
# 'by William Shakespeare\n' (more or less, there are a bunch of extra formatting characters
# like numbers and stage directions)

# Or in the case of the plays searching for 'ACT I. SCENE 1.\n'

start = f1.index('by William Shakespeare\n')+1
start

249

In [64]:
# The end of an individual work is given by 'THE END\n'

end = f1.index('THE END\n')
end

2869

In [65]:
# Pulling out the Sonnets
ws = f1[start:end]

# Removing them from f1
f1 = f1[end+1:]

ws[2600:-1]

["    Where Cupid got new fire; my mistress' eyes.\n",
 '\n',
 '\n',
 '                     154\n',
 '  The little Love-god lying once asleep,\n',
 '  Laid by his side his heart-inflaming brand,\n',
 '  Whilst many nymphs that vowed chaste life to keep,\n',
 '  Came tripping by, but in her maiden hand,\n',
 '  The fairest votary took up that fire,\n',
 '  Which many legions of true hearts had warmed,\n',
 '  And so the general of hot desire,\n',
 '  Was sleeping by a virgin hand disarmed.\n',
 '  This brand she quenched in a cool well by,\n',
 "  Which from Love's fire took heat perpetual,\n",
 '  Growing a bath and healthful remedy,\n',
 "  For men discased, but I my mistress' thrall,\n",
 '    Came there for cure and this by that I prove,  \n',
 "    Love's fire heats water, water cools not love.\n",
 '\n']

In [69]:
# Going through the whole collection and stripping out just the works of Shakespeare

while 'by William Shakespeare\n' in f1:
    start = f1.index('ACT I. SCENE 1.\n')
    end = f1.index('THE END\n')
    ws = ws.extend(f1[start:end])
    f1 = f1[end+1:]

AttributeError: 'NoneType' object has no attribute 'extend'

In [70]:
f1.index('ACT I. SCENE 1.\n')

33074

In [71]:
f1.index('THE END\n')

4190

In [72]:
ws


## Dealing with Strings

## Dealing with Numerical Data

## Dealing with Categorical Data

## Dealing with Missing Data

## Dealing with Images