# Example Notebook

* Before running, follow the set up steps from the README and make sure you have edited the file `config.txt` and moved it to `/etc/` (or your chosen location and change the code in `scraper.py` and `data_loader.py`).

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
from data_loader import Data_Loader
from data_viewer import Data_Viewer
from annotator import Annotator
import time

# Scraping data

In [4]:
# A list of urls that of submissions that you want to add to your graph. 
# These should be top level posts (not links to comments)
submissions = ['https://www.reddit.com/r/sanfrancisco/comments/7r3cy3/how_the_san_francisco_school_lottery_works_and/']
# Below is the full list of submissions I'm currently using for the school choice project
# submissions = [
#     'https://www.reddit.com/r/sanfrancisco/comments/bs5f69/just_had_the_elementary_school_lottery_explained/',
#     'https://www.reddit.com/r/sanfrancisco/comments/7r3cy3/how_the_san_francisco_school_lottery_works_and/',
#     'https://www.reddit.com/r/sanfrancisco/comments/4ah4no/fuck_the_sf_school_lottery_thats_all/',
#     'https://www.reddit.com/r/sanfrancisco/comments/b5kbse/how_the_student_assignment_system_works_sfusd/',
#     'https://www.reddit.com/r/sanfrancisco/comments/9hh9z8/two_sf_school_board_members_to_introduce/',
#     'https://www.reddit.com/r/sanfrancisco/comments/4646v8/experience_with_enrolling_in_sfusd_school/',
#     'https://www.reddit.com/r/sanfrancisco/comments/a5nrej/sf_school_board_plans_to_replace_muchcriticized/',
#     'https://www.reddit.com/r/sanfrancisco/comments/bhcxhb/san_francisco_had_an_ambitious_plan_to_tackle/',
#     'https://www.reddit.com/r/sanfrancisco/comments/5e5834/i_made_a_website_of_sf_elementary_school_test/',
#     'https://www.reddit.com/r/sanfrancisco/comments/cg5coh/sfusd_kindergarten/'
# ]

We now create `dl`, the `Data_Loader` object. The constructor creates a connection to the database and also creates a `Scraper` object which connects to the Reddit API. The Neo4j database should be running and the credentials file needs to be set up correctly for this to run.

In [5]:
dl = Data_Loader()

In [6]:
dl.clear_graph()

In [7]:
dl.load_submissions(submissions)

Adding submission: https://www.reddit.com/r/sanfrancisco/comments/7r3cy3/how_the_san_francisco_school_lottery_works_and/
Submission 7r3cy3 and 32 comments added in 9.3500s


# Querying and coding

To query and view our data, we use a `Data_Viewer`

In [8]:
dv = Data_Viewer()

For example, we can query a submission based on an id:

In [9]:
print(dv.view_submission("7r3cy3", include_comments = True))

[Submission 7r3cy3]
 bloobityblurp: How the San Francisco School Lottery Works, And How It Doesn’t (https://ww2.kqed.org/news/2018/01/11/how-the-san-francisco-school-lottery-works-and-how-it-doesnt-2/) 
 

[Comment dstxpky -> Submission 7r3cy3] SFCitizenDotCom: You're not supposed to try to "game" the system, you're supposed to put schools down what are your preferences. If you have unrealistic choices, you shouldn't be "shocked" when you don't get them. 

[Comment dsu21u7 -> Comment dstxpky | Submission 7r3cy3]
nlcund: Some years ago they made it strategy-free on the recommendation of a consultant from Stanford.  The original problem was that parents could only list six schools, so they would typically list the "best" schools (highest test scores) and fail, or try to add a few "safe" schools at the bottom of the list.  There were a lot of urban legends, such as listing the same school six times.

Now parents can list all the schools in the order they want (their true ranking) without 

Then, to add codes to our data, we use the `Annotator.annotate()` method. Codes should be formatted: "code1: subcode1: subcodesubcode1; code2: subcode2; code3" and so on. The annotator needs to know the id and type of the content node you're annotating as well as the specific substring you'd like to code.

In [10]:
a = Annotator()

In [11]:
a.annotate(code = "strategy", 
           content_id = "dstxpky", 
           content_type = "Comment", 
           content = "You're not supposed to try to \"game\" the system")

In [12]:
a.annotate(code = "algorithmic theories: popular schools harder to get; strategy",
           content_id = "dsu21u7",
           content_type = "Comment",
           content = "The original problem was that parents could only list six schools, so they would typically list the \"best\" schools (highest test scores) and fail, or try to add a few \"safe\" schools at the bottom of the list."
)

We can now view the content that has been coded with the label "strategy", with text highlighting to show the annotated substring

In [14]:
print(dv.view_coded("strategy"))

[Content with code: strategy]
-----------------------------
[Comment dsu21u7 -> Comment dstxpky | Submission 7r3cy3]
nlcund: Some years ago they made it strategy-free on the recommendation of a consultant from Stanford.  [33mThe original problem was that parents could only list six schools, so they would typically list the "best" schools (highest test scores) and fail, or try to add a few "safe" schools at the bottom of the list.[0m  There were a lot of urban legends, such as listing the same school six times.

Now parents can list all the schools in the order they want (their true ranking) without affecting their chances either way.  It's a bit laborious though.  

[Comment dstxpky -> Submission 7r3cy3] SFCitizenDotCom: [33mYou're not supposed to try to "game" the system[0m, you're supposed to put schools down what are your preferences. If you have unrealistic choices, you shouldn't be "shocked" when you don't get them. 


We can also modify (edit label or add/edit a description) or delete codes.

In [15]:
a.update_code("strategy", description = "tactically reporting preferences to try to get a better outcome")

In [16]:
print(a.get_code("strategy"))

strategy - tactically reporting preferences to try to get a better outcome


In [39]:
a.delete_code("strategy")

Delete code: strategy? Type 'Y' to confirm: y
Deleting code strategy


By default, the `Data_Loader` adds a full-text Lucene index to the content, code name and coded excerpts. This means we can efficiently query by searching the Comment and Submission nodes.

In [40]:
print(dv.view_search("the lottery"))

[Content matching search term: the lottery]
-------------------------------------------
[Comment dsuioqn -> Submission 7r3cy3] yaaaaayPancakes: After reading through that, it seems like one thing that would help everyone involved is better access to school data to make choices during the lottery rounds. They ought to build a website that indexes all the school data so parents can access it easily, anytime.

[Comment dsuqip6 -> Comment dsubxm7 | Submission 7r3cy3]
ispeakdatruf: > A, B, and C were under performing and lacking in resources. The school district wanted to offer these low income students more opportunities and so the lottery system came into place. The lottery is not perfect, but I understand the purpose of it. 

Please, please tell me how the lottery helps with lack of resources.

[Comment dsu8itp -> Comment dsu54w9 | Submission 7r3cy3]
ultralame: > then fix that fucking school!

So simple! why don't you  explain how to do this?  You'd win a Nobel prize.

Not that I disagre

# Viewing and analyzing codes