# Unit 5: Text Analysis

## Contents

### Lab Questions

## Getting Started

A company is often interested in determining the public's opinion of the company's products and services.  Traditionally, this data might have been obtained through the use of customer satisfaction surveys or broader market research efforts.  With the growth in use of online shopping there are large collections of user reviews attesting to the quality of a company's offerings.  Similarly, the popular social media platforms allow users to easily share opinions in a public forum.  Collecting and analyzing this data can provide useful information that could be used to guide a company's decision-making process.

In previous lessons, we worked primarily with *structured data* - data with a well-defined organizational structure that adhered to certain properties which we could easily infer through exploration.  For example, the when we worked with county auditor data, each dataset was tabular with rows that represented an individual parcel and columns represented different properties of those parcels.  Each column had a specific data type and it was unlikely that a column value for a specific row would violate that type property. Structured data could be stored in a relational database (making use of the structured imposed by such a data store) and easily searched.

Text in the form of user reviews or social media posts are often classified as *unstructured data* - data lacking a pre-defined organizational model where the characteristics of one record could differ significantly from another.  For example, in a collection of user reviews, one record might be "This is a great product. I've owned it for five years and it always works perfectly!".  Characteristics of this data could be that it is 84 characters long, written in English, and uses only [ASCII](https://en.wikipedia.org/wiki/ASCII) characters and we might write code that relies on those assumptions.  However, among the many reviews might be this review, "Buen equipo, robusto, rápido, ... Perfecto para el uso que se le va a dar. Tiene de todo. Diseño moderno." This review is clearly not written in English and uses non-ASCII characters so any analysis that assumed these features might fail. We'll discuss collecting, processing, and analyzing unstructured data, focusing on text data.

As computing power and algorithms improve, it has become increasingly easy to extract information from unstructured data.  With this ability, it is important to be aware that there are potential ethical and legal issues.  As users freely post their thought on social media, is it ethically right to mine this data in attempt to increase profit or for other gain? While an exploration of legal issues is outside the scope of the material covered here, it is important that one is aware that such issues could exist. 

To work with text data, we'll make use of the following libraries.

- [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/) - a web scraping library
- [NLTK](https://www.nltk.org/) - the Natural Language Toolkit, a library for processing language data
- [requests](http://docs.python-requests.org/en/master/) - library for making HTTP requests and processing the results
- [requests-oauthlib](https://github.com/requests/requests-oauthlib) - [OAuth](https://en.wikipedia.org/wiki/OAuth) support for the requests library 
- [tinydb](https://tinydb.readthedocs.io/en/latest/) - a document data store implemented in Python


We'll use `pip` to install these libraries. We've worked with some of these before but we can ensure they are installed using `pip`.

In [None]:
!pip install beautifulsoup4 nltk requests requests_oauthlib tinydb

## Collection

### Web Scraping

A lot of human-generated text data in the form of reviews or social media posts are accessible on a web site. One way of accessing this data is through a process knowing as [*web scraping*](https://en.wikipedia.org/wiki/Web_scraping).  Web pages are creating using [HTML](https://en.wikipedia.org/wiki/HTML), a markup language that provides a way of indicating structure for the accompanying content.  We can make use of this structure as a way of identifying the location of text on a website that is of interest to us.  Modern websites make heavy use of JavaScript and CSS for interactive functionality and to tailor their appearance; while these are nice when we, as humans, are browsing a site, they don't usually impart structural changes to the underlying data so we'll only focus on HTML content for scraping.  

Let's look at a relatively simple example.  Consider this web page that was generated using Amazon product review data available on [Kaggle](https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products/data).  This page is stored locally with our other data; to load data from a web server, we could use a library like [requests](http://docs.python-requests.org/en/master/).

In [10]:
from IPython.display import HTML
HTML(filename="./data/05-reviews.html")

We have hundreds of reviews for a specific product where each review includes a title, a rating and the review text.  To process this data, we first need to extract it.  Let's look at a bit of the underlying HTML.

In [27]:
with open('./data/05-reviews.html') as infile:
    reviews = infile.readlines()
    
for line in reviews[:100]:
    print(line)

<!DOCTYPE html>

<html>

<head>

    <title>Amazon Tap Reviews</title>

    <style>

        .review_container {

            padding: 1em;

        }



        .review_container:nth-child(odd) {

            background-color: rgb(145, 200, 231);

        }

        

        .title {

            font-weight: bold;

        }



        .rating {

            margin-left: 1em;

        }

        

        .review {

            display: block;

            margin: 0.5em;

        }

    </style>

</head>

<body>

    <div class="review_container">

        <span class="title">Tap Alexa on the go!</span>

        <span class="rating">5.0</span>

        <span class="review">It was just a few weeks ago that I was bemoaning the fact that I did not scoop up an Echo back when they were being

            released last year. Not only could it be had at nearly half off the current price, but it came with a voice remote

            and we could have had a year longer using this amazing dev

Examining the HTML, we see that within the [*body*](https://www.w3schools.com/tags/tag_body.asp) element, we have individual [*div*](https://www.w3schools.com/tags/tag_div.asp) elements for each review.  For each review, there are three [*span*](https://www.w3schools.com/tags/tag_span.asp) elements that correspond to the review title, the review rating, and the review text, in order.  We could write code to read the content of the file and extract this review data by examining each line as a string and looking for *div* and *span* elements and extracting the corresponding text but the code would become complicated.  We previously used *pandas* to extract data from HTML but that relied on the data being stored in an HTML [*table*](https://www.w3schools.com/tags/tag_table.asp), which is not the case with this data.  

One option is to make use of Python's standard library [XML processing modules](https://docs.python.org/3.6/library/xml.html); [XML](https://en.wikipedia.org/wiki/XML) is another markup language often used to store data. While HTML is different and [not a subset of XML](https://stackoverflow.com/q/5558502/1298998); we can often use Python's XML tools to process HTML.  We'll use the [*ElementTree*](https://docs.python.org/3.6/library/xml.etree.elementtree.html#module-xml.etree.ElementTree) module to do this.  We begin by importing the module and parsing the file.

In [35]:
import xml.etree.ElementTree as ET
tree = ET.parse('./data/05-reviews.html')

We can think of the document structure as being tree-like: at the root, we have the [*html*](https://www.w3schools.com/tags/tag_html.asp) element, which contains (or branches into) a [*head*](https://www.w3schools.com/tags/tag_head.asp) and a *body* element, which each branch further.  Frist, let's get the root element. using the *getroot()* method.

In [37]:
root = tree.getroot()
display(root)

<Element 'html' at 0x112c41ea8>

We can see that the root is, in fact, an *html* element.  To see its child elements, we can use the *getchildren()* method that returns a list of child elements.

In [38]:
root.getchildren()

[<Element 'head' at 0x112c41d68>, <Element 'body' at 0x112c41ef8>]

If we know the specific tag for a child, we can use the *find()* or *findall()* methods.  The *find()* method will return the first matching child and *findall()* will return a list of matching children.

In [43]:
body = root.find("body")
display(body)

<Element 'body' at 0x112c41ef8>

Looking at the HTML text content from the file above, we know that the *body* element contains a *div* for each review.  We can use *findall()* to access these children.

In [44]:
divs = body.findall("div")
len(divs)

540

There are 540 *div* elements in the *body* element.  This implies there are 540 reviews on the page that we can extract.  Looking again at the HTML file, we can see that within each *div* there are three *span* elements corresponding to the review title, rating, and text.  Let's look at the first *div* in `divs` and see its children.

In [45]:
div = divs[0]
div.findall("span")

[<Element 'span' at 0x112c41db8>,
 <Element 'span' at 0x112c41d18>,
 <Element 'span' at 0x112c41cc8>]

To access any text associated with an element, we can use the *text* property.  We can iterate through the `div`'s children that are *spans* and print the text for each of them.  We'll use [*enumerate*()](https://docs.python.org/3/library/functions.html#enumerate) to keep track of the index of the span whose text we're displaying.

In [47]:
for i, span in enumerate(div.findall('span')):
    print(i, span.text)

0 Tap Alexa on the go!
1 5.0
2 It was just a few weeks ago that I was bemoaning the fact that I did not scoop up an Echo back when they were being
            released last year. Not only could it be had at nearly half off the current price, but it came with a voice remote
            and we could have had a year longer using this amazing device...Well, here I am, a year late to the game and
            I can now proudly say that I own both an Echo and this new little Tap! Can I just say that I am OBSESSED with
            all things Echo and Alexa. These are devices that you may not realize how much you'll use it until you've given
            them a try.I have been spending the last few years getting our home automated. I currently use a combination
            of Smartthings, IFTTT and Logitech Harmony to control the home. Alexa fits right into the middle of all that
            and gives us the ability to control our home via voice. For me, this is the killer feature. I can integra

In HTML and XML, tags can have [*attributes*](https://www.w3schools.com/xml/xml_attributes.asp).  Here, each *div* and *span* element has a `class` attribute.  The XML parser stores a tag's attributes as a dictionary that can be accessed using the *attrib* property.  Here, we can see the attributes for the *div* we are working with. 

In [50]:
div.attrib

{'class': 'review_container'}

Though it looks like the review *span* elements appear in the same order, title-rating-text, we can use the `class` attribute to be sure as to which *span* we are working with while iterating.

In [53]:
for span in div.findall('span'):
    span_class = span.attrib.get('class')
    if span_class == 'title':
        print('TITLE: ', span.text)
    elif span_class == 'rating':
        print('RAITING: ', span.text)
    elif span_class == 'review':
        print('TEXT: ', span.text)
    else:
        print('UNKNOWN CLASS')

TITLE:  Tap Alexa on the go!
RAITING:  5.0
TEXT:  It was just a few weeks ago that I was bemoaning the fact that I did not scoop up an Echo back when they were being
            released last year. Not only could it be had at nearly half off the current price, but it came with a voice remote
            and we could have had a year longer using this amazing device...Well, here I am, a year late to the game and
            I can now proudly say that I own both an Echo and this new little Tap! Can I just say that I am OBSESSED with
            all things Echo and Alexa. These are devices that you may not realize how much you'll use it until you've given
            them a try.I have been spending the last few years getting our home automated. I currently use a combination
            of Smartthings, IFTTT and Logitech Harmony to control the home. Alexa fits right into the middle of all that
            and gives us the ability to control our home via voice. For me, this is the killer fea

We can now iterate through the HTML structure to extract the title, rating, and text for each review.  We'll look at storage options for unstructured data later but for now, we can create a pandas DataFrame. As we iterate through the HTML, we'll extract the data, create a dictionary for each review's data, and append the dictionary to the DataFrame.  We can convert the rating to a floating point value as we iterate as well. For clarity, we'll start by parsing the HTML file again.

In [63]:
import pandas as pd

reviews = pd.DataFrame(columns=['title', 'rating', 'text'])

tree = ET.parse('./data/05-reviews.html')
root = tree.getroot()
body = root.find('body')
for review_div in body.findall('div'):
    
    # skip any div that doesn't have a class attribute with "review_continer"
    if review_div.attrib.get('class') != 'review_container':
        continue
    
    review_data = {} 
    
    for span in review_div.findall('span'):
        span_class = span.attrib.get('class')
        if span_class == 'title':
            review_data['title'] = span.text
        elif span_class == 'rating':
            review_data['rating'] = float(span.text)
        elif span_class == 'review':
            review_data['text'] = span.text

    reviews = reviews.append([review_data], ignore_index=True)

reviews.head()

Unnamed: 0,title,rating,text
0,Tap Alexa on the go!,5.0,It was just a few weeks ago that I was bemoani...
1,Great for what it does,5.0,Look at this product as a portable speaker fir...
2,"Awesome, smart little portable speaker",5.0,This Amazon tap is not only a great Bluetooth ...
3,Amazing technology!,5.0,Bought this on Deal of the Day which surprised...
4,A++++++++,5.0,Amazing Sound! All around great product! Nothi...


At this point, we can calculate the descriptive statistics for the `rating` column

In [64]:
reviews.rating.describe()

count    540.000000
mean       4.531481
std        0.787736
min        1.000000
25%        4.000000
50%        5.000000
75%        5.000000
max        5.000000
Name: rating, dtype: float64

An alternative to using the standard library is to use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library.  Having installed it earlier, we can import it.

In [65]:
from bs4 import BeautifulSoup

To start, we use the *BeautifulSoup()* function specifying the filename for the HTML content and, optionally, the HTML parser we'd like to use.  

In [79]:
with open('./data/05-reviews.html') as infile:
    soup = BeautifulSoup(infile, 'html.parser')

While Beautiful Soup allows use to navigate the document tree as we did previously, it provide enhanced functionality that makes it much easier to find the data we want.  For our data, we know that each of the reviews is contained in a *div* with class `review_container`.  We can quickly get the collection of elements matching these criteria using Beautiful Soup's *find_all()* method.

In [80]:
divs = soup.find_all('div', {'class': 'review_container'})
len(divs)

540

We see that we have found all the review *divs*.  We can also use the *find()* method to return the first match.  Just as with the XML parser, we can access an element's text using the *text* property.

In [83]:
div = divs[0]
rating_span = div.find('span', {'class': 'rating'})
rating_span.text

'5.0'

We can rewrite our code that iterate through the file to extract review data using Beautiful Soup.  Notice how much easier the code is to read.

In [85]:
reviews = pd.DataFrame(columns=['title', 'rating', 'text'])

with open('./data/05-reviews.html') as infile:
    soup = BeautifulSoup(infile, 'html.parser')

for review_div in soup.find_all('div', {'class': 'review_container'}):
    review_data = {
        'title': review_div.find('span', {'class': 'title'}).text,
        'rating': float(review_div.find('span', {'class': 'rating'}).text),
        'text': review_div.find('span', {'class': 'review'}).text
    }
    reviews = reviews.append([review_data], ignore_index=True)

reviews.head()

Unnamed: 0,title,rating,text
0,Tap Alexa on the go!,5.0,It was just a few weeks ago that I was bemoani...
1,Great for what it does,5.0,Look at this product as a portable speaker fir...
2,"Awesome, smart little portable speaker",5.0,This Amazon tap is not only a great Bluetooth ...
3,Amazing technology!,5.0,Bought this on Deal of the Day which surprised...
4,A++++++++,5.0,Amazing Sound! All around great product! Nothi...


### Working with an API

As mentioned previously, website owners typically discourage the use of scraping for collection of data; in fact, many prohibit the practice in their terms of service.  The data is, however, often made available via an [application programming interface](https://en.wikipedia.org/wiki/Application_programming_interface) or API.  An API provides us with the means of systematically accessing data from a website or other Internet data store.  An overview of what an API is as well as a directory of available APIs is avilable at [Programmable Web].  For our work, we'll use the [Twitter API](https://developer.twitter.com/en/docs), specifically the [search] functionality.  

Twitter requires users to authenticate their API request. To do this, we can follow the [instructions](https://developer.twitter.com/en/docs/basics/getting-started#get-started-app) provided by Twitter. we must first register an application with Twitter.  We begin by going to the [Twitter Apps](https://apps.twitter.com) page and signing in.  If you don't have a Twitter account, you will have to create one and provide a mobile phone number in order to create an app. Once logged in, we can create a new app.  When creating a new app, we have to provide some details including the application name, description, and a website; we also have to accept the [developer agreement](https://dev.twitter.com/overview/terms/agreement-and-policy).  See the following image for an example of the details provided for a new application.   

<figure>
    <img src="./images/05-twitter.png" alt="twitter app details">
    <figcaption style="text-align: center; font-weight: bold">Twitter App Details</figcaption>
</figure>

Once the application is created, the application details page will be displayed.  Near the top of the page is a collection of tabs including *Keys and Access Tokens* and *Permission*.  We'll use the *Keys and Access Tokens* page in a moment but for now, let's change the access premissions for this application.  Click *Permission*, choose *Read only* and update the settings. 

Typically, to access data on a website we have to log in with a username or email address and a password to verify our identity and that we are allowed to access certain content.  In the same way, we have to prove our identity when using many APIs.  Twitter makes use of a standard known as [OAuth](https://en.wikipedia.org/wiki/OAuth) to support access to its API.  We'll need the following pieces of data from the Twitter App page in order to access the API.

- Consumer Key
- Consumer Secret
- Access Token
- Access Token Secret

The consumer key and consumer secret are on the Twitter App's *Keys and Access Tokens* page - navigate to the page.  We'll need to generate an access token and can do so from the same page.  After generating the token, we can store the necessary data in variables within the notebook. Replace the empty strings below with strings containing the appropriate data.  Be careful when sharing this data.

In [87]:
# update these values in order to use the Twitter API
CONSUMER_KEY = "8uKzs39bUdERFBm4oPIGnK7xC"
CONSUMER_SECRET = "YIq07pLR3NKaByehUoOVRiIa2aPPogk1J1WKUbCalzkYUWm5FN"
ACCESS_TOKEN = "2255201048-1r5XWubYrszR4hs9cki2ZqwudYei8QqQVgC0GYe"
ACCESS_TOKEN_SECRET = "QCETecVEcJw9j9BLhsXgitz23vGTKy2qxUR57QIr7ES8x"

With this data, we can now begin to access the API.  We'll create a session with our authentication details and with which we can make additional API requests. To do this, we use the *requests_oathlib* module which relies on *requests*.  

In [90]:
from requests_oauthlib import OAuth1Session

twitter = OAuth1Session(CONSUMER_KEY, CONSUMER_SECRET, 
                        ACCESS_TOKEN, ACCESS_TOKEN_SECRET)


We can use the `twitter` session to make requests.  Let's look at an example of the [search API](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html).  To search for popular tweets containing the word "columbus" and with full text in the results, we can make a request to the following URL:

```
https://api.twitter.com/1.1/search/tweets.json?q=columbus&result_type=popular&tweet_mode=extended
```

To make the request in code, we'll use the session's [*get()*](http://docs.python-requests.org/en/master/api/#requests.get) method.

In [98]:
url = "https://api.twitter.com/1.1/search/tweets.json?q=columbus&result_type=popular&tweet_mode=extended"
response = twitter.get(url)

The Twitter API returns JSON content.  We can deserialize the content using the response's [*json()*](http://docs.python-requests.org/en/master/api/#requests.Response.json) method. 

In [99]:
results = response.json()
results

{'search_metadata': {'completed_in': 0.061,
  'count': 15,
  'max_id': 0,
  'max_id_str': '0',
  'next_results': '?max_id=982344287798018047&q=columbus&include_entities=1&result_type=popular',
  'query': 'columbus',
  'since_id': 0,
  'since_id_str': '0'},
 'statuses': [{'contributors': None,
   'coordinates': None,
   'created_at': 'Sat Apr 07 16:43:21 +0000 2018',
   'display_text_range': [0, 218],
   'entities': {'hashtags': [],
    'symbols': [],
    'urls': [],
    'user_mentions': []},
   'favorite_count': 490,
   'favorited': False,
   'full_text': 'Columbus is sitting out Panarin, Bobrovsky, Jones and Werenski tonight. New Jersey is scratching Hall, Palmieri and Zajac. I did not expect this. It sure looks like these teams are actively trying to duck the Penguins.',
   'geo': None,
   'id': 982660106843140096,
   'id_str': '982660106843140096',
   'in_reply_to_screen_name': None,
   'in_reply_to_status_id': None,
   'in_reply_to_status_id_str': None,
   'in_reply_to_user_id': No

Looking at the results, we can see that the actual tweets themselves are coupled with additional metadata.  For now, we'll work with only the tweets themselves.  Looking at the results, we can see that the outer data structure is a dictionary with a key named `statuses` and the associated value is a list where each element corresponds to a tweet.  Each tweet element is a dictionary where the full tweet is the value paired with the key `full_text`.  We can get the first tweet using the following.

In [101]:
results['statuses'][0]['full_text']

'Columbus is sitting out Panarin, Bobrovsky, Jones and Werenski tonight. New Jersey is scratching Hall, Palmieri and Zajac. I did not expect this. It sure looks like these teams are actively trying to duck the Penguins.'

Or we can extract all the tweets.

In [102]:
tweets = []
for tweet in results['statuses']:
    tweets.append(tweet['full_text'])
    
tweets

['Columbus is sitting out Panarin, Bobrovsky, Jones and Werenski tonight. New Jersey is scratching Hall, Palmieri and Zajac. I did not expect this. It sure looks like these teams are actively trying to duck the Penguins.',
 'My “thank god” comment was a sarcastic joke about being a Michigan man in Ohio. And I love Columbus &amp; always will. Never spoke bad about the club esp the fans, coaches and my brothers I played with. I just could never support an owner moving the team. One love Cbus #SaveTheCrew https://t.co/Z7LU3wFt1S',
 'Our hearts are heavy as we see the tragic news of the bus crash involving the Humboldt Broncos. The Columbus Blue Jackets join the hockey world in sending our thoughts &amp; prayers to the players, coaches, staff, families &amp; community touched by this devastating accident.',
 'Columbus, Ohio we ended this tour with destruction and now we are off to Canada for our headline tour. The tour that never ends and we couldn’t be more thankful. Also, we are taking o

## Storage

When working with unstructured data, we often have a need to aggregate and store it.  Historically, relational database management systems have been used to store structured data and SQL has been used to interface with these systems.  As the need to store unstructured data increased and traditional databases became less practical as storage solutions, [NoSQL](https://en.wikipedia.org/wiki/NoSQL) databases began to be developed.  There are a variety of types of NoSQL databases, each with its own benefits and drawbacks.  One of these types is a [document store](https://en.wikipedia.org/wiki/Document-oriented_database) where a *document* is associated with a key and stored in some sort of collection.  Here, a *document* is usualy some semi-structured data such as JSON or XML data.  Many document stores make use of the semi-structured nature of the documents to provided additional functionality such as querying for data within a document.  There are many document store implementations (and many more NoSQL database implementations in general).  As an example of working with a document store, we'll make use of [TinyDB](https://tinydb.readthedocs.io/en/latest/index.html), a document store implemented entirely in Python.  As part of this example, we'll create two stores - one for the review data and the other for the tweets we collected via the Twitter API.

Let's start by importing TinyDb and creating a data store for the twitter data.

In [109]:
import tinydb
db = tinydb.TinyDB('./data/documents.json')

TinyDB relies on an external file to persist data.  This file will itself be a JSON file. Any data we add to the TinyDB document store will be converted to JSON and stored in the file.  Like tables in a relational databases, many NoSQL have a means of partitioning data.  TinyDB has a notion of [tables](https://tinydb.readthedocs.io/en/latest/usage.html?highlight=table#tables) to do this but we must keep in mind that unlike a relational database, we cannot create relationships between tables.  Let's create two tables, one for the Twitter data and one for the review data.

In [110]:
twitter_table = db.table('twitter')
review_table = db.table('review')

Let's iterate through our review data and store each review in the document store.  To do this, we'll use the table's [*insert()*](https://tinydb.readthedocs.io/en/latest/api.html?highlight=insert#tinydb.database.Table.insert) method. To simplify the use of the database, we'll iterate through the DataFrame storing review data, convert each row into a dictionary, and store the dictionary as the document.

In [121]:
for index, row in reviews.iterrows():
    review_table.insert(row.to_dict())

We can get a count of all the documents in the table using the *len()* function.

In [122]:
len(review_table)

540

While we can access all the data in a table using the [*all()*](https://tinydb.readthedocs.io/en/latest/api.html?highlight=insert#tinydb.database.Table.insert) method, it's often more convenient to perform queries to filter the data. To create queries we use the table's [*search()*](https://tinydb.readthedocs.io/en/latest/api.html?highlight=insert#tinydb.database.Table.search) method along with TinyDB's [*Query*](https://tinydb.readthedocs.io/en/latest/api.html?highlight=insert#tinydb.queries.Query) class. First we create an instance of the *Query* class.

In [123]:
Review = tinydb.Query()

The *Query* class allows us to access key/value pairs within the document using the dot operator.  For example, we can refer to each review's "rating" key using `Review.rating`.  Here we find all reviews there the rating is 5.0.

In [124]:
five_stars = review_table.search(Review.rating == 5.0)
len(five_stars)

356

We can also search within text in a specific field in a document.

In [133]:
amazing_reviews = review_table.search(Review.text.search("amazing"))
len(amazing_reviews)

24

Let's turn to the Twitter Data. Recall that each tweet and its metadata is stored in a list of statuses in `results`, the processed API response.  We can iterate through the list and store each element as a document.

In [135]:
for tweet in results['statuses']:
    twitter_table.insert(tweet)

len(twitter_table)

While we can reuse the existing instance of *Query* to query the Twitter table, we'll create a second one.

In [137]:
Tweet = tinydb.Query()

Recall that the `full_text` key is paired with the actual message.  Let's search for tweets containing `Columbus`.

In [139]:
columbus_tweets = twitter_table.search(Tweet.full_text.search("Columbus"))
len(columbus_tweets)

15

If there were no results, try `columbus` (with a lower-case c). Each of the the tweets should include either `Columbus` or `columbus` based on how we searched for tweets using the API.

Notice that the tweet data also includes a nested user data.  Each user has a location.  Most document stores support searching nested data.  The following searches for documents where the user's location is `Columbus, OH`.  The number of results will depend on the data that was returned by the API.

In [140]:
users_in_columbus = twitter_table.search(Tweet.user.location == "Columbus, OH")
len(users_in_columbus)

2

## Analysis
