# Data loading

To begin working on our project, the first thing we need to do is load our data. The data is located in a file called `train.csv` in the `data` directory.

In [1]:
# Tip: It is considered good practice to have all your imports in a first cell in your notebook. This makes it
# easier to see all the libraries that the project uses, as well as to avoid repeated or unnecessary imports. Add all 
# your imports in this cell


In [2]:
# 1. Load the file train_data.csv into a dataframe


You will observe that there is another file called `test_data.csv`. When training a machine learning model, it is necessary to set aside some data for testing the performance of the model. This data must not be used when modeling or exploring. We will come to that later in the course, don't worry. 

In [3]:
# 2. What are the contents of the dataframe? Show the first 10 rows


The `sentiment` column tells you if the tweet is positive (1) or negative (0). `text` is just the tweet's text.

# Data exploration and visualization

Before starting any Machine Learning modeling, you should first _always_ , explore your data. This will not only make you understand the data you are working with, but will also make it easier for modeling later. Think about this:

* Simple statistics will surface some of the possible biases your dataset may have: Under-representation of certain groups, skewed distributions, etc.
* Visualising distributions can easily help detecting outliers or weird values. This could help you detect, for example, errors during data collection, or bad data quality

Let's start with some simple questions about our twiter data, just to get to know what we are working with.

In [4]:
# 3. How many tweets does the dataset contain?


In [5]:
# 4. How many positive and negative tweets are there?


In [6]:
# 5. What is the proportion of positive tweets?


In [7]:
# 6. A picture is worth a thusand words... can you plot the distribution of positive/negtives?


In [8]:
# 7. What are your thougts on the distribution of the data?

Now let's use some domain knowledge to check data quality. We know that tweets cannot be longer than 140 characters. Let's plot the distribution of tweet length. 

In [8]:
# 8. Add a column `text_length_chars` to the dataframe. For each row, this column should tell the length 
# in characters of the tweet in that row


In [9]:
# 9. What is the minimum, max and average length of a tweet in this dataset?


There seems to be something wrong, right? Let's see.

In [10]:
# 10. How many tweets are over 140 characters?


In [11]:
# 11. Those are quite a few... print 10 random ones


In [12]:
# 12. How does the longest one look like?


This is an example of wrong data collection/format. The CSV file was supposed to have two columns only: `sentiment` and `text`, but for some rows in the CSV, they contained an extra column `Sentiment140`. This is most probably a mistake when building the dataset. A quick Google search tells us that `Sentiment140` is [a tool for twitter sentiment anlysis](http://help.sentiment140.com/home). Probably, the dataset was partly collected there.

The rest of the tweets with a length greater than 140 are not as extreme:

In [13]:
# 13. Create a histogram showing the distribution of tweet length for those tweets greater than 140 characters only


As you can see, the histogrm is extremely skewed, and only that formatting error makes up for the long tail.

**Q)** Why do you think that there are still tweets longer than 140 characters? Do you think they are valid?

_Tip_: Print some of those tweets and see what's there

In [14]:
# 14. Since we have a lot of tweets, and for the sake of simplicity, let's remove all tweets from our dataset with 
# length greater than 200


In [15]:
# 15. Plot again a histogram of the tweet length distribution, how does it look now?


In [17]:
# 16. Do you observe a particular shape? anything surprising?


Now, let's do the same analysis for length _in words_.

In [16]:
# 17. Create a column `text_n_words`. Consider that words are separated by spaces in the text.


In [17]:
# 18. What is the minimum, max and average number of words in a tweet in this dataset?


In [18]:
# 19. Plot a histogram of the length in words


In [21]:
# 20. Let's now see if these distributions are similar for positives or negative tweets. Make tow histograms for the
# lengths (characters and words), distinguishing between positive and negative tweets


In [24]:
# 21. Are the distributions different? What does this mean?

In [25]:
# 22. Let's now move on to the content. Answer the following questions consider the whole dataset when answering them

# How many different words are there in the dataset?
# What are the top 100 words?
# Plot the distribution of word counts (frequencies) for the first 100 words

In [29]:
# 23. What can you say about the top 100 words and their frequency?


## What now?

You've gotten to know your data. By now you know:

* How big is your dataset
* How your positives/negatives are distributed
* How long are the tweets, and what are the most common tweet length
* What are the most common words in the tweets
* How are frequesncies of words distributed

In the next part of the preoject, we will learn about how to pre-process data for text analysis.