
# Chinatown and the Culture of Exclusion

**Developers: Michaela Palmer, Maya Shen, Cynthia Leu, Chris Cheung**

We will be reading and analyzing representations of Chinatown in the form of data and maps. In addition, we will learn how data tools can be used to read and analyze large volumes of text.

**Explore more of this lab here: https://github.com/ds-modules/XENGLIS-31AC**

## What this lab will cover
* Running Jupyter Notebooks
* Data Analysis of Chinatowns' demographics
* Visualization & Interpretation
* Using Data Tools to Analyze Primary Texts

## What you need to do
* Read the content, complete the questions
* Analyze the data
* Submit the assignment



### Official map of Chinatown in San Francisco - 1855
#### Prepared under the supervision of the special committee of the Board of Supervisors. July 1885.

![image](data/PJM_1093_01.jpg)


This map reflects the pervasive bias against the Chinese in California and in turn further fostered the hysteria. It was published as part of an official report of a Special Committee established by the San Francisco Board of Supervisors "on the Condition of the Chinese Quarter." The Report resulted from a dramatic increase in hostility to the Chinese, particularly because many Chinese laborers had been driven out of other Western states by vigilantes and sought safety in San Francisco (Shah 2001, 37).
<p>The substance and tone of the Report is best illustrated by a few excerpts: "The general aspect of the streets and habitations was filthy in the extreme, . . . a slumbering pest, likely at any time to generate and spread disease, . . . a constant source of danger . . . , the filthiest spot inhabited by men, women and children on the American continent." (Report 4-5). "The Chinese brought here with them and have successfully maintained and perpetuated the grossest habits of bestiality practiced by the human race." (Ibid. 38).
<p>The map highlights the Committee's points, particularly the pervasiveness of gambling, prostitution and opium use. It shows the occupancy of the street floor of every building in Chinatown, color coded to show: General Chinese Occupancy|Chinese Gambling Houses|Chinese Prostitution|Chinese Opium Resorts|Chinese Joss Houses|and White Prostitution.
The Report concludes with a recommendation that the Chinese be driven out of the City by stern enforcement of the law: "compulsory obedience to our laws [is] necessarily obnoxious and revolting to the Chinese|and the more rigidly this enforcement is insisted upon and carried out the less endurable will existence be to them here, the less attractive will life be to them in California. Fewer will come and fewer will remain. . . . Scatter them by such a policy as this to other States . . . ." (Ibid. 67-68)

# Importing Modules

Just run this (may take up to a minute).  
*If you're running this locally, you'll need to install the datascience module for this to work! (You should be fine if you're on DataHub).*

In [8]:
%%capture
!python -m spacy download en
!pip install --no-cache-dir wordcloud
!pip3 install --no-cache-dir -U folium
!pip install --no-cache-dir textblob
from datascience import make_array
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import *
%matplotlib inline
import folium
import pandas as pd
from IPython.display import HTML, display, IFrame
import folium
import spacy
from wordcloud import WordCloud
from textblob import TextBlob
import geojson

ModuleNotFoundError: No module named 'spacy'

# Reading Primary Texts


In this class, we have been learning how to 'close-read' primary texts. Close-reading generally involves picking select passages and reading for the latent meanings embedded in word choice, syntax, the use of metaphors and symbols, etc. Here, we are introducing another way of analyzing primary texts using computational methods. Computational text analysis generally involves 'counting' words. Let's see how this works by analyzing some of the poems written by Chinese immigrants on Angel Island.

<p>Run the following cell to import the poems from a .txt file.

In [6]:
with open('data/islandpoetry1_22.txt', "r") as f:
    raw = f.read()
print(raw)

The sea-scape resembles lichen twisting and
turning for a thousand li
There is no shore to land and it is difficult to
walk.
With a gentle breeze I arrived at the city
thinking all would be so.
At ease, how was one to know he was to live in a
wooden building?

Because my house had bare walls, I began
rushing all about.
The waves are happy, laughing "Ha-ha!"
When I arrived on Island, I heard I was
forbidden to land.
I could do nothing but frown and feel angry
at heaven.

As a rule, a person is twenty before he starts
making a living.
Family circumstances have forced me to
experience wind and dust.
The heartless months and years seem bent on
defeating me.
It is a pity that time quickly ages one.

The gold and silver of America is very
appealing.
Jabbing an awl into the thigh in search of
glory,
I embarked on the journey.
Not only are my one-thousand pieces of gold
already depleted, but
My countenance is blackened. It is surely for
the sake of the family.

Four days before the Qiqiao Fest

We're interested in which words appear the most often in our set of poems. It's pretty hard to read or see much in this form. We'll coming back to the topic of what words are the most common with actual numbers a bit later but for now, run the following cell to generate two interesting visualizations of the most common words (minus those such as "the", "a", etc.). 

In [7]:
wordcloud = WordCloud().generate(raw)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(raw)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

NameError: name 'WordCloud' is not defined

<div class="alert alert-info">
Question: What are the most common words you notice? Judging from these words, what do you think these poems are about?
</div>

Oops, it seems we've forgotten just how many poems we have in our set. Luckily we have a quick way of finding out! Each "\n" in the display of the poem text indicates a line break. It turns out that each poem is separated by an empty line, a.k.a. two line breaks or "\n"'s.

In [None]:
num_poems = len(raw.split("\n\n"))
num_poems

We can also use this idea to calculate the number of characters in each poem.

In [None]:
num_char_per_poem = [len(p) for p in raw.split("\n\n")]
print(num_char_per_poem)

This is interesting but seems like just a long list of numbers. What about the average number of characters per poem?

In [None]:
np.mean(num_char_per_poem)

Let's look at it in histogram form to get a better idea of our data.

In [None]:
Table().with_columns("Character Count", np.asarray(num_char_per_poem)).hist()

Each bar of this histogram tells us what proportion of the poems (the height of the bar) have that many characters (the position of the bar on the x-axis).

We can also use "\n" to look at [enjambment](https://en.wikipedia.org/wiki/Enjambment) too. Let's calculate the proportion of lines that are enjambed out of the total number of lines per poem. 

In [None]:
from string import punctuation

poems = raw.split("\n\n")

all_poems_enjambment = []
for p in poems:
    lines = p.split("\n")
    poems = raw.split("\n\n")
    enjambment = 0
    for l in lines:
        try:
            if l[-1] in punctuation:
                pass
            else:
                enjambment += 1
        except:
            pass
    enj = enjambment/len(lines)
    all_poems_enjambment.append(enj)

print(all_poems_enjambment)

Once again, what about the average?

In [None]:
np.mean(all_poems_enjambment)

Let's now return to the question of the words that appear the most frequently in these 49 poems. First we have to use spaCy, an open-source software library for Natural Language Processing (NLP), to parse through the text and replace all the "\n"'s with spaces.

In [None]:
nlp = spacy.load('en', parser=False)
parsed_text = nlp(raw.replace("\n", " "))

We can separate all the words/symbols and put them in a table.

In [None]:
toks_tab = Table()
toks_tab.append_column(label="Word", values=[word.text for word in parsed_text])
toks_tab.show()

In [None]:
toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
toks_tab.show()

Now let's create a new table with even more columns using the "tablefy" function below.

In [None]:
def tablefy(parsed_text):
    toks_tab = Table()
    toks_tab.append_column(label="Word", values=[word.text for word in parsed_text])
    toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
    toks_tab.append_column(label="Lemma", values=[word.lemma_ for word in parsed_text])
    toks_tab.append_column(label="Stop Word", values=[word.is_stop for word in parsed_text])
    toks_tab.append_column(label="Punctuation", values=[word.is_punct for word in parsed_text])
    toks_tab.append_column(label="Space", values=[word.is_space for word in parsed_text])
    toks_tab.append_column(label="Number", values=[word.like_num for word in parsed_text])
    toks_tab.append_column(label="OOV", values=[word.is_oov for word in parsed_text])
    toks_tab.append_column(label="Dependency", values=[word.dep_ for word in parsed_text])
    return toks_tab

In [None]:
tablefy(parsed_text).show()

Next, let's look at the frequency of words. However, we want to get rid of words such as "the" and "and" (stop words), punctuation, and spaces. We can do this by selecting rows that are not stop words, punctuation, or spaces and then sorting by word!

In [None]:
word_counts = tablefy(parsed_text).where("Stop Word", are.equal_to(False)).where(
    "Punctuation", are.equal_to(False)).where(
    "Space", are.equal_to(False)).group("Word").sort("count",descending=True)
word_counts

In this table, we have both the words "sad" and "sadness" - it seems strange to separate them. It turns out that these words are part of the same "lexeme", or a unit of meaning. For example, "run", "runs", "ran", and "running" are all part of the same lexeme with the lemma 'run'. Lemmas are another column in our table from above! Nice!

In [None]:
lemma_counts = tablefy(parsed_text).where("Stop Word", are.equal_to(False)).where(
    "Punctuation", are.equal_to(False)).where(
    "Space", are.equal_to(False)).group("Lemma").sort("count",descending=True)
lemma_counts

Now let's look at how many words there are of each part of speech.

In [None]:
pos_counts = tablefy(parsed_text).where("Stop Word", are.equal_to(False)).where(
    "Punctuation", are.equal_to(False)).where(
    "Space", are.equal_to(False)).group("POS").sort("count",descending=True)
pos_counts

We can also look at the proportions of each POS out of all the words!

In [None]:
for i in np.arange(pos_counts.num_rows):
    pos = pos_counts.column("POS").item(i)
    count = pos_counts.column("count").item(i)
    total = np.sum(pos_counts.column("count"))
    proportion = str(count / total)
    print(pos + " proportion: " + proportion)

If we're interested in words' relations with each other, we can look at words that are next to each other. The function below returns the word following the first instance of the word you search for in the specified source.

In [None]:
def nextword(word, source):
    for i, w in enumerate(source):
        if w == word:
            return source[i+1]

Mess around a bit with this function! Change the "word" argument.

In [None]:
split_txt = raw.split()

# Change the target or "home" to other words!
nextword("home", split_txt)

We are specifically interested in the word "I" and the words that poets use in succession. Let's make an array of all the words that come after it in these poems. For easier viewing, the phrases have been printed out. What do you notice?

In [None]:
one_after_i = make_array()
for i, w in enumerate(split_txt):
    if w == "I":
        one_after_i = np.append(one_after_i, split_txt[i+1])
for i in one_after_i:
    print("I " + i)

Above we have only shown the next word, what about the next two words? Does this give you any new insight?

In [None]:
two_after_i = make_array()
for i, w in enumerate(split_txt):
    if w == "I":
        two_after_i = np.append(two_after_i, split_txt[i+1] + " " + split_txt[i+2])
for i in two_after_i:
    print("I " + i)

Try doing some exploring of your own! If you're feeling stuck, feel free to copy and edit code from above.

In [None]:
# Write your own code here!

### Sentiment Analysis

We can do some analysis of the overall sentiments, or emotions conveyed, in each of the poems using the code below. Here, we analyze the overall sentiment of each poem individually. Once you run the next cell, you'll see the sentiment values for each poem. A value below 0 denotes a negative sentiment, and a value above 0 is positive.

In [None]:
sentiments = make_array()
for p in poems:
    poem = TextBlob(p)
    sentiments = np.append(sentiments, poem.sentiment.polarity)
sentiments

Now, what does this mean? It appears that the number of poems with negative sentiment is about the same as the number of poems with positive or neutral (0) sentiment. We can look at the proportion of negative poems in the next cell:

In [None]:
neg_proportion = np.count_nonzero(sentiments < 0)/len(sentiments)
neg_proportion

Okay, so just under half of the poems have negative sentiment. So, on average the poems have slightly positive sentiment, right?

We can also perform sentiment analysis across the text of all of the poems at once and see what happens:

In [None]:
poems_all = TextBlob(raw.replace('\n', ' '))
poems_all.sentiment.polarity

This way of analyzing the text tells us that the language in all of the poems has slightly negative sentiment.

One more analysis we can perform is computing the average sentiment of the poems, given the list of each individual poem's sentiments that we computed earlier:

In [None]:
np.mean(sentiments)

This method also tells us that our poems have slightly negative sentiment, on average.

Here, let's look at one of the poems with it's sentiment value:

In [None]:
poem_3 = poems[3].replace('\n', ' ')
print(poem_3)
print(TextBlob(poem_3).sentiment.polarity)

Let's look at one more poem:

In [None]:
poem_47 = poems[47].replace('\n', ' ')
print(poem_47)
print(TextBlob(poem_47).sentiment.polarity)

<div class="alert alert-info">
Question: Do you think the sentiment analyzer did a good job assigning the sentiment to these poems? What might that mean for the trends we see in our average sentiment across the poems?
</div>

---

***Please fill out our [feedback form](https://goo.gl/forms/Ir0Ulg5WDQogHhK72)!***