# Social Media Mining: Saving Data
### Vincent Malic - Spring 2018

## Part V. Saving Data as File with `Pickle` library
There are many ways to save the data you've collected as a file onto your computer. When working on your final projects, you'll probably want to comparmentalize the code you use to *get* the data, the code you use to *process* it, and the code you use to *analyze* it. To accomplish this, you'll need a way to save your data *as files* so you have a persistent form of the data that you can pass from one piece of code to another (or from one person to another, if you're working in a group).

In this tutorial, we'll cover one of the easier ways to save and load data, using a module called ``pickle``. Later on on the course, we'll learn how to use the data-structure library called ``pandas`` which will provide us with more options. 

## First, Get some Data. 
* Assign variables for API key and secret
* Import Tweepy and assign key, secret to aliases

In [1]:
API_KEY = ""
API_SECRET = ""

In [2]:
import tweepy
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

### In homage of first study we discussed in class
* Pull 100 tweets each from campaign accounts of Clinton and Trump. 
* Initialize empty lists
* Iterate over tweepy.Cursor object, with user_timeline method, indicating user id
* From user status, select text, favorite_count, retweet_count, status.source

In [3]:
clinton_tweets = []
trump_tweets = []

for status in tweepy.Cursor(api.user_timeline, id="HillaryClinton").items(100):
    clinton_tweets.append((status.text, status.favorite_count, status.retweet_count, status.source))
    
for status in tweepy.Cursor(api.user_timeline, id="realDonaldTrump").items(100):
    trump_tweets.append((status.text, status.favorite_count, status.retweet_count, status.source))

In [5]:
print(clinton_tweets[1])
print("*"*50)
print(trump_tweets[1])

('Tune in today at 2:30pm ET/11:30am PT! https://t.co/rnrRfjlsAI', 9299, 1512, 'Twitter for iPhone')
**************************************************
('Thank you to Sue Kruczek, who lost her wonderful and talented son Nick to the Opioid scourge, for your kind words w… https://t.co/0kIdepXBdi', 62961, 12798, 'Twitter for iPhone')


### Now we've got data. 
* 100 tweets from Clinton and Trump, for each tweet we have 4 features: text, favorite count, retweet count, and source. 
* Now, we want to save it and be able to load data to whatever other programs we write to process and analyze the text, without having to run this data-getting algorithm again. 
* How do we do that?

## Use the ``pickle`` module 
* Save data in a binary file on our computer. 
* Import module with alias ``pkl``  

In [6]:
import pickle as pkl

## Save your data as a file:
* Two arguments; first argument is the name of 
* Second argument is a function, open(), that itself takes two arguments 
* Enter the file name you want to save it to `.pkl`, and ``wb`` means WRITE, BINARY. 
* Save  Clinton tweets to clinton.pkl and the Trump tweets to trump.pkl.
```
pkl.dump([the variable you want to save], open([the name of the file to save to], "wb"))
```

In [7]:
pkl.dump(clinton_tweets, open("clinton.pkl", "wb"))
pkl.dump(trump_tweets, open("trump.pkl", "wb"))

Now I'm going to use some Python code to delete the ``clinton_tweets`` and ``trump_tweets`` variables from Python.

In [8]:
del clinton_tweets
del trump_tweets

The data we collected is gone, vanished.

In [9]:
clinton_tweets

NameError: name 'clinton_tweets' is not defined

## Retreive the data from file
* All we have to do is load it. 
* `pkl.load(open([the name of the file to load], "rb"))`

### Look at second argument of open() function
* Use pkl `load` method, with `open()` function
* Arguments: indicate file name to read, and ``rb`` means READ, BINARY
* CAUTION: do not mistake using ``rb`` when WRITING a pickle and ``wb`` when READING a pickle. 

##### Avoid at all costs!!!

In [10]:
new_clinton = pkl.load(open("clinton.pkl", "rb"))
new_trump = pkl.load(open("trump.pkl", "rb"))

### Our data is back.

In [12]:
print(new_clinton[1])
print("*"*50)
print(new_trump[1])

('Tune in today at 2:30pm ET/11:30am PT! https://t.co/rnrRfjlsAI', 9299, 1512, 'Twitter for iPhone')
**************************************************
('Thank you to Sue Kruczek, who lost her wonderful and talented son Nick to the Opioid scourge, for your kind words w… https://t.co/0kIdepXBdi', 62961, 12798, 'Twitter for iPhone')
