# An Introduction to NoSQL with Python

## Table of Contents
+ [0. Introduction](#0)
    - [0.1 Motivation of the Tutorial](#0.1)
    - [0.2 Goal of the Tutorial](#0.2)
    - [0.3 Meaning and Features of NoSQL](#0.3)
    - [0.4 Mainstream Nosql Databases](#0.4)
+ [1. Redis](#1)
    - [1.1 Let's Connect to a Redis Database on the Cloud](#1.1)
    - [1.2 Let's write a Redis client](#1.2)
    - [1.3 Dive into Real Dataset](#1.3)
    - [1.4 Learn to Use Redis Hash](#1.4)
    - [1.5 Possible Solutions to Store Nested Data in Redis](#1.5)
+ [2. MongoDB](#2)
    - [2.1 Let's Connect to a MongoDB Database on the Cloud](#2.1)
    - [2.2 Let's Write a MongoDB Client](#2.2)
    - [2.3 Practice a Simply Query on Data that Already Exists in the Database](#2.3)
    - [2.4 Dive into Real Dataset](#2.4)
    - [2.5 Create Table and Insert Data](#2.5)
    - [2.6 Practice a Range Query on More than One Document](#2.6)
+ [3. Cassandra](#3)
    - [3.1  Let's Connect to a Standalone Cassandra in Local Machine](#3.1)
    - [3.2 Let's write a Cassndra Client](#3.2)
    - [3.3 Dive into Real Dataset](#3.3)
    - [3.4 Create Table, Insert Data and Retrieve Data](#3.4)
    - [3.5 Summary](#3.5)
+ [4. Neo4j](#4)
    - [4.1 Let's Connect to a Neo4j Database in a Local Machine](#4.1)
    - [4.2 Let's Write a Neo4j Client](#4.2)
    - [4.3 Dive into Real Dataset](#4.3)
    - [4.4 Let's Create Nodes and Relationships](#4.4)
    - [4.5 Let's Visualize on Graph](#4.5)
+ [5. Final Takeaway](#5)
+ [6. References](#6)






<a id="0"><h2> 0. Introduction</h2></a>

<a id="0.1"><h3> 0.1 Motivation of the Tutorial</h3></a>
The exponential increase in data in recent years has highlighted the challenges and limitations to store and manage data solely using traditional relational databases. Even though some RDBMS now support NoSQL features, the results are oftentimes unsatisfactory, which gives NoSQL databases a niche market to build specific solutions to address less structured data in a distributed manner efficiently.  

<a id="0.2"><h3> 0.2 Goal of the Tutorial</h3></a>
This tutorial will take you to explore some of the mainstream nosql databases on the market. I will explain the core concepts of using each database and touch upon their strengths and weaknesses. Meanwhile, I will walk you through how to use python to interact with these databases. However, I won't be naming all the query syntax as you can find that online easily. I will just be giving you one or two examples based on the twitter data scraped from HW1. After completing this tutorial, you will acquire some sense on picking the right database and have adequate knoweldge to work with nosql databases independently.

<a id="0.3"><h3> 0.3 Meaning and Features of NoSQL</h3></a>
NoSQL means Not Only SQL, implying that when designing a software solution or product, there may be more than one storage mechanism that could be used based on users' needs. There is no prescriptive definition for NoSQL, but we can still observe they have following commnoalities, such as:

* Not using SQL
* Support running on clusters
* Mostly open-source
* Built for the 21st century web estates
* Schema-less

<a id="0.4"><h3>0.4 Mainstream Nosql Databases</h3></a>
 
There are total of four different types of nosql data models. In this tutorial, we will be introducing one mature database from each category. The selection of a particular database from each cateogry is inspired by the course 95-737 NoSQL Database Management. 


#### * Key-value: 

* Data Model : the fundamental data model of a key-value store is a collection of key-value pairs where the key is unique in the collection. A key can be an ID or a name or anything you want to use as an identifier. The value can be anything from strings, lists, hashes, to sets. This type of database typically do not know or care what is stored in the value and the contents are often represented as blobs giving you no visibility into its actual contents. This makes the functionality of these databases fairly limited, especially with queries. 
* Sample Database: Redis
* strength: rich data structures, highly performant
* weakness: don't support nested key-value pairs, costly to store huge datasets
* common usage: caching users, timeline and tweets (Twitter), storing graph of who’s following whom (Pinterest)


#### * Document:
* Data Model: documents are the main concept in document databases. The database stores and retrieves documents, which can be XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures which can consist of maps, collections, and scalar values. Document databases can store documents in the value part of the key-value store; think about document databases as key-value stores where the value is examinable.
* Sample Database: MongoDB
* Strength: very flexible to query by value
* Weakness: express highly complex relationship, faceted search (depends on model design), and transactions over multiple documents
* Common usage: producet catalogue, inventory management 

#### * Column Family: 

* Data Model: in column-family databases, each row consists of a collection of columns. A collection of similar rows then makes up a column family. In a relational databases, this would be equivalent to a collection of rows making up a table. The main difference is that in a column-family database, rows do not have to contain the same columns. De-normalizeid is usually a better solution and the design is centered around query patterns.
* Sample Database: Cassandra
* Strength: processing enormous amounts of mostly non-relational data, highly fault tolerant
* Weakness: Only allow to query by primary key
* Common usage: a web log analytics data warehouse and sensor data.

#### * Graph: 
* Data Model: in graph databases, a graph is a collection of nodes (vertices) and relationships (edges). This type of database is designed to capture complex, dynamic relationships in highly connected data.
* Sample Database: Neo4j
* Strength: querying deeply connected data that has many relationships expressed with complex joins
* Weakness: does not support sharding in which each node contains only a portion of the total data
* Common usage: social network


<a id="1"><h2>1. Redis</h2></a>
<a id="1.1"><h3>1.1 Let's connect to a Redis database on the cloud</h3></a><br>

In [None]:
import sys
!{sys.executable} -m pip install redis
!{sys.executable} -m pip install rejson

Steps to Follow:
1. The fastest way to create a Redis database is to deploy on cloud. Register an account on https://app.redislabs.com/#/sign-up?. 
2. After you have the database start and running, you can begin to think about how python can interact with this remote database. In fact, it's prettry easy. All information you need are the endpoint information and Redis password. The endpoint is a URL that consists of host and port in such format **host:post**, which you will be asked to fill in these numbers to the python code connecting to the cloud database.
<img src="https://image.ibb.co/j6NcUn/redis.png" alt="Drawing" style="width: 800px;"/>


- information to know: There is currently no free Redis GUI. 

<a id="1.2"><h3>1.2 Let's write Redis client </h3></a><br>

In [353]:
import redis

try:
    conn = redis.StrictRedis(
        host='Endpoint',
        port=14872,
        password='Redis Password, 
        charset="utf-8",
        # transform a byte string to utf-8 string
        decode_responses=True)
    
    print (conn)
    conn.ping()
    print ('Connected!')
except Exception as ex:
    print ('Error:', ex)
    exit('Failed to connect, terminating.')

StrictRedis<ConnectionPool<Connection<host=redis-14872.c13.us-east-1-3.ec2.cloud.redislabs.com,port=14872,db=0>>>
Connected!


<a id="1.3"><h3>1.3 Dive into Real Dataset </h3></a><br>

In [356]:
# we first use the partial user data from HW1
users_small = pd.read_csv("users_small.csv", na_filter=False)
print(len(users_small))
users_small.head()

6


Unnamed: 0,screen_name,name,location,created_at,friends_count,followers_count,statuses_count,favourites_count
0,realDonaldTrump,Donald J. Trump,"New York, NY",Wed Mar 18 13:46:38 +0000 2009,42,11397769,33136,38
1,Trump,Trump Organization,"New York, NY",Wed Apr 13 16:51:54 +0000 2016,35,9954,43,125
2,TrumpGolf,Trump Golf,,Mon Feb 03 13:46:03 +0000 2014,200,8797,758,251
3,TiffanyATrump,Tiffany Trump,,Tue Feb 01 20:59:30 +0000 2011,79,63138,573,28
4,IngrahamAngle,Laura Ingraham,DC,Thu Jun 25 21:03:25 +0000 2009,289,851876,26523,71


In [357]:
# preprocess the dataframe to index on screen_name, which can later be used as the key of the dictionary
users_samll_name=users_small.screen_name
users_small=users_small.set_index("screen_name")
users_small

Unnamed: 0_level_0,name,location,created_at,friends_count,followers_count,statuses_count,favourites_count
screen_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
realDonaldTrump,Donald J. Trump,"New York, NY",Wed Mar 18 13:46:38 +0000 2009,42,11397769,33136,38
Trump,Trump Organization,"New York, NY",Wed Apr 13 16:51:54 +0000 2016,35,9954,43,125
TrumpGolf,Trump Golf,,Mon Feb 03 13:46:03 +0000 2014,200,8797,758,251
TiffanyATrump,Tiffany Trump,,Tue Feb 01 20:59:30 +0000 2011,79,63138,573,28
IngrahamAngle,Laura Ingraham,DC,Thu Jun 25 21:03:25 +0000 2009,289,851876,26523,71
mike_pence,Mike Pence,,Fri Feb 27 23:04:51 +0000 2009,1208,207552,4572,746


In [358]:
# transform the dataframe to dictionary
users_dict_small=users_small.to_dict(orient='index')
users_dict_small

{'IngrahamAngle': {'created_at': 'Thu Jun 25 21:03:25 +0000 2009',
  'favourites_count': 71,
  'followers_count': 851876,
  'friends_count': 289,
  'location': 'DC',
  'name': 'Laura Ingraham',
  'statuses_count': 26523},
 'TiffanyATrump': {'created_at': 'Tue Feb 01 20:59:30 +0000 2011',
  'favourites_count': 28,
  'followers_count': 63138,
  'friends_count': 79,
  'location': '',
  'name': 'Tiffany Trump',
  'statuses_count': 573},
 'Trump': {'created_at': 'Wed Apr 13 16:51:54 +0000 2016',
  'favourites_count': 125,
  'followers_count': 9954,
  'friends_count': 35,
  'location': 'New York, NY',
  'name': 'Trump Organization',
  'statuses_count': 43},
 'TrumpGolf': {'created_at': 'Mon Feb 03 13:46:03 +0000 2014',
  'favourites_count': 251,
  'followers_count': 8797,
  'friends_count': 200,
  'location': '',
  'name': 'Trump Golf',
  'statuses_count': 758},
 'mike_pence': {'created_at': 'Fri Feb 27 23:04:51 +0000 2009',
  'favourites_count': 746,
  'followers_count': 207552,
  'friends_

<a id="1.4"><h3>1.4 Learn to Use Redis Hash</h3></a><br>
Redis Hashes are maps between string fields and string values, so they are the perfect data type to represent objects The following codes are demonstraing how to use Redis MSET command to set multiple values to multiple keys.

In [361]:
#pass a Python dictionary to mset command
conn.mset(users_dict_small)
#We can access the value by key, here we are accessing the user info about realDonaldTrump
print(conn.get('realDonaldTrump'))
print(type(conn.get('realDonaldTrump')))

{'name': 'Donald J. Trump', 'location': 'New York, NY', 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'friends_count': 42, 'followers_count': 11397769, 'statuses_count': 33136, 'favourites_count': 38}
<class 'str'>


<a id="1.5"><h3>1.5 Possible Solutions to Store Nested Data in Redis</h3></a><br>

We can see that from the return type of **conn.get('realDonaldTrump')** is a **string**, which means there are no further keys for us to further drill down to easily find out say the location of "realDonaldTrump" unless we load the string back to json again.

As I mentioned earlier, Redis data structures are **not nestable**, so if you want to read or write the entire dictionary, there are several ways you can try:
1. The simplest is to serialize it (using JSON ) and store it as a plain String in Redis. (What I just demonstrated in 1.4!)
2. The second way is a relatively new. (There aren't many tutorials out there yet!) It's called the ReJSON module that will let you store JSON objects in redis and manipulate them directly. Reference: http://rejson.io/


<a id="2"><h2>2. MongoDB</h2></a><br>
<a id="2.1"><h3>2.1 Let's connect a MongoDB database on the Cloud</h3></a><br>

First, we need to understand the correponding terms for MongoDB storage. MongoDB stores data records in collections. Therefore, collections are actually analogous to tables in relational databases. In MongoDB, data records are stored as BSON documents, which is a binary representation of JSON documents. Within each MongoDB document are field-and-value pairs.
<img src="https://image.ibb.co/kbN5FS/mongovsql.png" alt="Drawing" style="width: 800px;"/>

Here I recommend you to use mlab, which is a fully managed cloud database service featuring automated provisioning and scaling of MongoDB databases. So you can you simply connect to a remote server to interact with the database.<br>

Steps to Follow:

1. Sign up on https://mlab.com/
2. Create a database and a collection. In my case, you can see I've named my Database **project4**. As we now know, a collection is a table, so from the picture, you can also know that I have named one collection **results** and this collection consists of 3 documents.
3. Remember at the **Users** section, you have to create one username and password as an administrator.

<img src="https://image.ibb.co/h0jwN7/mlab.png" alt="Drawing" style="width: 800px;"/>

<a id="2.2"><h3>2.2 Let's Write a MongoDB Client</h3></a>

In [None]:
import sys
!{sys.executable} -m pip install pymongo

In [366]:
# libaray reference: http://api.mongodb.com/python/current/tutorial.html
# PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python.
from pymongo import MongoClient
# the parameter is a standard MongDB URI directly retrieved from the picture above.
client = MongoClient("mongodb://username:password!@ds117759.mlab.com:17759/project4") 

<a id="2.3"><h3>2.3 Practice a Simply Query on Data that Already Exists in the Database</h3></a><br>

In [362]:
# specify the database name as project4
db = client['project4']
# specify the collection name as result
collection = db['results']
cursor = collection.find()
# display all the documets within the result collection. You can see there are exactly 3 documents in total.
for document in cursor:
    print(document)

{'_id': ObjectId('5ab3d927548865ae5f3a9c44'), 'Key1': 'Value1'}
{'_id': ObjectId('5ab3d92b548865ae5f3a9c45'), 'Key2': 'Value2'}
{'_id': ObjectId('5ab3d92e548865ae5f3a9c46'), 'Key3': 'Value3'}


<a id="2.4"><h3>2.4 Dive into Real Dataset</h3></a><br>

In [363]:
# we first use the partial user data from HW1
users = pd.read_csv("users3.csv", na_filter=False)
# total of 298 rows
print(len(users)) 
users.head()

298


Unnamed: 0,name,screen_name,location,created_at,friends_count,followers_count,statuses_count,favourites_count
0,Donald J. Trump,realDonaldTrump,"New York, NY",Wed Mar 18 13:46:38 +0000 2009,42,11397769,33136,38
1,Trump Organization,Trump,"New York, NY",Wed Apr 13 16:51:54 +0000 2016,35,9954,43,125
2,Trump Golf,TrumpGolf,,Mon Feb 03 13:46:03 +0000 2014,200,8797,758,251
3,Tiffany Trump,TiffanyATrump,,Tue Feb 01 20:59:30 +0000 2011,79,63138,573,28
4,Laura Ingraham,IngrahamAngle,DC,Thu Jun 25 21:03:25 +0000 2009,289,851876,26523,71


<a id="2.5"><h3>2.5 Create Table and Insert Dataset</h3></a><br>

In [367]:
# create a new collection called accounts to store all the user data
accounts = db.accounts
 
for idx, user in users.iterrows():
    user = {"name": user.loc['name'],"screen_name": user.loc['screen_name'],"location": user.loc['location'],
            "date": user.loc['created_at'],"friends_count":user.loc['friends_count'],
            "followers_count":user.loc['followers_count'],"statuses_count":user.loc['statuses_count'],
            "favourites_count":user.loc['favourites_count']}
    #To insert a document into a collection we can use the insert_one() method:
    accounts.insert_one(user).inserted_id
        

<a id="2.6"><h3>2.6 Practice a Range Query on More than One Document</h3></a><br>

In [368]:
#find all the users with follower_count greater than 30000000
import pprint
for user in accounts.find({"followers_count": {"$gt": 30000000}}).sort("followers_count",-1):
    pprint.pprint(user)

{'_id': ObjectId('5aba6642de9eac13c14d0fac'),
 'date': 'Thu Aug 14 03:50:42 +0000 2008',
 'favourites_count': 321,
 'followers_count': 61931395,
 'friends_count': 36339,
 'location': 'California',
 'name': 'Ellen DeGeneres',
 'screen_name': 'TheEllenShow',
 'statuses_count': 12711}
{'_id': ObjectId('5aba6643de9eac13c14d0fd2'),
 'date': 'Fri Mar 18 18:36:02 +0000 2011',
 'favourites_count': 81,
 'followers_count': 30624508,
 'friends_count': 369,
 'location': '@happyhippiefdn',
 'name': 'Miley Ray Cyrus',
 'screen_name': 'MileyCyrus',
 'statuses_count': 8131}
{'_id': ObjectId('5aba6645de9eac13c14d100c'),
 'date': 'Wed Jun 24 18:44:10 +0000 2009',
 'favourites_count': 6,
 'followers_count': 30520703,
 'friends_count': 169,
 'location': 'Seattle, WA',
 'name': 'Bill Gates',
 'screen_name': 'BillGates',
 'statuses_count': 2131}


<a id="3"><h2>3. Cassandra</h2></a><br>
<a id="3.1"><h3>3.1  Let's Connect to a Standalone Cassandra in Local Machine</h3></a><br>

Just like MongoDB, Cassandra also has its own term to define the way they store the data. Below is the correponding terms for Cassandra database. 
<img src="https://image.ibb.co/eftCvS/cassandra.png" alt="Drawing" style="width: 800px;"/>

### Steps to folow:
1. Download the latest Apache Cassandra at http://cassandra.apache.org/download/
2. Unzip Cassandra using the command zxvf as shown below. <br> 
   $tar zxvf apache-cassandra-2.1.2-bin.tar.gz. 
3. cd to Downloads/apache-cassandra-3.11.2/bin
4. command to start Cassandra sudo ./cassandra -R 
5. command to start the CQL shell ./cqlsh

<a id="3.2"><h3>3.2 Let's write a Cassndra Client</h3></a><br>

In [None]:
# this take around 3-5 minutes to complete. Don't kill it in the midway. 
import sys
!{sys.executable} -m pip install cassandra-driver
!{sys.executable} -m pip install uuid
!{sys.executable} -m pip install time_uuid

In [372]:
# Reference: https://datastax.github.io/python-driver/getting_started.html#connecting-to-cassandra
from cassandra.cluster import Cluster
cluster = Cluster()
# connect to a default key space
session = cluster.connect('twitterapp')
# You can always change a Session’s keyspace using set_keyspace()
session.set_keyspace('twitterapp')

<a id="3.3"><h3>3.3 Dive into Real Dataset </h3></a><br>

In [371]:
import pandas as pd
tweets = pd.read_csv("tweets_small.csv", na_filter=False)
print(len(tweets))
tweets.head()

4


Unnamed: 0,screen_name,created_at,retweet_count,favorite_count,text
0,realDonaldTrump,Fri Sep 09 02:00:32 +0000 2016,2859,7030,Final poll results from NBC on last nights Com...
1,mgleslie6,Fri Apr 22 00:33:15 +0000 2016,13,0,"RT @wendellray: ""The Crime."" Now Playing on a ..."
2,Trump,Fri Sep 16 21:35:08 +0000 2016,21,61,Did you know that @TrumpGolfDC Is located just...
3,realDonaldTrump,Thu Sep 08 18:16:25 +0000 2016,3710,12860,Mexico has lost a brilliant finance minister a...


<a id="3.4"><h3>3.4 Create Table, Insert Data and Retrieve Data </h3></a><br>

In [393]:
# Example 1: Create tweets table and insert tweets into the tweets table
# session.execute("DROP TABLE tweets")
session.execute("CREATE TABLE tweets(tweet_id uuid PRIMARY KEY,username text,body text)")

<cassandra.cluster.ResultSet at 0x1080d7b00>

Tweets are stored in a simple table where the primary key is a UUID column, ensuring the tweet’s uniqueness. We
don’t track when the tweet was added in this table as that’s handled by the user’s timeline (see the userline and
timeline table creation below).

**UUID** : A UUID (Universal Unique Identifier) is a 128-bit number used to uniquely identify some object or entity on the Internet. 

In [394]:
import uuid
# passing Parameters to CQL Queries


for i in range(len(tweets["screen_name"])):
    screen_name = tweets['screen_name'].iloc[i]
    tweet = tweets['text'].iloc[i]
    
    session.execute(
    """
    INSERT INTO tweets (tweet_id, username, body)
    VALUES (%s, %s, %s)
    """,
    (uuid.uuid1(), screen_name, tweet))

# The above query is identical and can be translated to the following CQL query:
# INSERT INTO tweets (tweet_id, username, body) VALUES (uuid(), tweets['screen_name'].iloc[i], tweets['text'].iloc[i]);


In [395]:
import numpy as np
rows = session.execute('SELECT tweet_id, username, body FROM tweets')

tweets_from_db = []
for user_row in rows:
    tweets_from_db.append([user_row.tweet_id,user_row.username,user_row.body])
tweets_all=pd.DataFrame(np.asarray(tweets_from_db),columns=['tweet_id','username','body'])
tweets_all


Unnamed: 0,tweet_id,username,body
0,84f5d094-31d7-11e8-a0b7-784f439848e9,realDonaldTrump,Mexico has lost a brilliant finance minister a...
1,84f578cc-31d7-11e8-ac78-784f439848e9,Trump,Did you know that @TrumpGolfDC Is located just...
2,84f30350-31d7-11e8-8b67-784f439848e9,realDonaldTrump,Final poll results from NBC on last nights Com...
3,84f4e2e2-31d7-11e8-8d08-784f439848e9,mgleslie6,"RT @wendellray: ""The Crime."" Now Playing on a ..."


In [407]:
# Example 2: Create userline table and insert users into the userline table
# session.execute("DROP TABLE userline")
session.execute("CREATE TABLE userline (username text, time float, tweet_id uuid, PRIMARY KEY (username, tweet_id))")

<cassandra.cluster.ResultSet at 0x108466b00>

In [409]:
import time_uuid

# Insert user and its corresponding tweet into the userline table
for i in range(len(tweets["screen_name"])):
    screen_name = tweets["screen_name"].iloc[i] 
    
    session.execute(
    """
    INSERT INTO userline (username, time, tweet_id)
    VALUES (%s, %s, %s)
    """,
    (screen_name, time_uuid.utctime(), tweets_all['tweet_id'][i]))


In [411]:
userlines = session.execute('SELECT username, time, tweet_id FROM userline')

userlines_from_db = []
for user_row in userlines:
    userlines_from_db.append([user_row.username,user_row.time,user_row.tweet_id])
userlines_all=pd.DataFrame(np.asarray(userlines_from_db),columns=['username', 'time', 'tweet_id'])
userlines_all

Unnamed: 0,username,time,tweet_id
0,mgleslie6,1522170000.0,84f578cc-31d7-11e8-ac78-784f439848e9
1,Trump,1522170000.0,84f30350-31d7-11e8-8b67-784f439848e9
2,realDonaldTrump,1522170000.0,84f4e2e2-31d7-11e8-8d08-784f439848e9
3,realDonaldTrump,1522170000.0,84f5d094-31d7-11e8-a0b7-784f439848e9


<a id="3.5"><h3>3.5 Summary</h3></a>

Curious why username appear both in two tables? Here I only give the example of Userline. But in fact Timeline and Userline tables can both be the result of denormalization. Timeline keeps track of people who are going to receive a user’s tweets, the time when the post was created and the tweetID. Userline keeps track of people who posted a tweet, the time of tweet posted and tweetID. By having denormalized tables stored in the database, a lot of queries involved join operations can be saved to enhance overall read performance. For example, in the case of a timeline, we don’t have to join tweets with the followers table to find out where a tweet should display, more specifically, who is going to see the tweet. But in fact, the table can be even more demoralized. By that I mean to directly copy the body of the tweet into both userline and timeline table. This will allow us to get everything we need in one table. 


<a id="4"><h2>4 Neo4j</h2></a>
<a id="4.1"><h3>4.1 Let's Connect to a Neo4j Database in a Local Machine</h3></a>

### Steps to folow:
1. You can download a neo4j Desktop, which can be run as a console application or as a service from https://neo4j.com/download/
2. In my case, I have create a **database** called **db** with its **graph** name **data**
3. Start the server
<img src="https://image.ibb.co/kst6N7/neo4j_Desktop.png" alt="Drawing" style="width: 800px;"/>

<a id="4.2"><h3>4.2 Let's Write Neo4j Client</h3></a>

In [None]:
import sys
# liabray reference: http://py2neo.org/v3/
!{sys.executable} -m pip install py2neo
# liabray reference: https://github.com/merqurio/neo4jupyter
!{sys.executable} -m pip install neo4jupyter
# liabray reference: http://ipython-cypher.readthedocs.io/en/latest/introduction.html
#!{sys.executable} -m pip install ipython-cypher

In [461]:
from py2neo import Graph
# The Graph class represents a Neo4j graph database. We connect to the database "db" and the graph "data"
graph = Graph("http://localhost:7474/db/Graph/", password="0321")

<a id="4.3"><h3>4.3 Dive into Real Data</h3></a>
I have used part of the twitter data given in HW1. In this tutorial, I have a dataset called **edges3.csv** , which consist of 77 relationships. I also have another dataset **users3.csv**, which consist of 298 individual user's account information on twitter. I will be using data from users3.csv to generate the nodes and edges3.csv to model the relationships. You might be wondering the purpose of restoring the data this way. Don't worry! Later when we query the data, you will immediately find out why.

In [413]:
# we first use the partial edges data from HW1
import pandas as pd
edges = pd.read_csv("edges3.csv", na_filter=False)
print(len(edges))
edges.head()

77


Unnamed: 0,screen_name,friend
0,realDonaldTrump,Trump
1,realDonaldTrump,TrumpGolf
2,realDonaldTrump,TiffanyATrump
3,realDonaldTrump,IngrahamAngle
4,realDonaldTrump,mike_pence


In [412]:
# we first use the partial users data from HW1
users = pd.read_csv("users3.csv", na_filter=False)
print(len(users))
users.head()

298


Unnamed: 0,name,screen_name,location,created_at,friends_count,followers_count,statuses_count,favourites_count
0,Donald J. Trump,realDonaldTrump,"New York, NY",Wed Mar 18 13:46:38 +0000 2009,42,11397769,33136,38
1,Trump Organization,Trump,"New York, NY",Wed Apr 13 16:51:54 +0000 2016,35,9954,43,125
2,Trump Golf,TrumpGolf,,Mon Feb 03 13:46:03 +0000 2014,200,8797,758,251
3,Tiffany Trump,TiffanyATrump,,Tue Feb 01 20:59:30 +0000 2011,79,63138,573,28
4,Laura Ingraham,IngrahamAngle,DC,Thu Jun 25 21:03:25 +0000 2009,289,851876,26523,71


<a id="4.4"><h3>4.4 Let's Create Nodes and Relationships</h3></a>

In [437]:
from py2neo import Node
from py2neo import Relationship

# create a dictionary of key:value=screename:node of the person
nodeDicts={}
for i in range(len(users["name"])):
    name=users["name"].iloc[i]
    screen_name=users["screen_name"].iloc[i]
    nodeDicts[users["screen_name"].iloc[i]] = Node("Person", name=name,screen_name=screen_name)

options = {"Person":"screen_name"}

# put all the nodes into the graph
for node in nodeDicts.values():
    graph.create(node)
# create relationships 
for i in range(len(edges["screen_name"])):
    graph.create(Relationship(nodeDicts[edges["screen_name"].iloc[i]], "LIKES",  nodeDicts[edges["friend"].iloc[i]]))

<a id="4.5"><h3>4.5 Let's Visualize on Graph</h3></a>

In [None]:
import neo4jupyter
# first thing you must do is call the neo4jupyter.init_notebook_mode() to load all the javascript.
neo4jupyter.init_notebook_mode()
# drawing a graph it's as easy as giving the funcion draw
neo4jupyter.draw(graph,options)
# graph is displayed using javascript but the result won't be saved on jupyter notebook. So I took a screenshot.


<img src="https://image.ibb.co/cHQ327/Screenshot_2018_03_27_12_44_16.png" alt="Drawing" style="width: 800px;"/>

### * What can we observe from the graph?<br>
At our first sight, we can spot two obvious clusters and many individual nodes. When your hover the mouse over a certain node, we will see the name and the screen_name of the node appear. This is because nodes can contain key-value pairs known as properties in neo4j. The left center node is **Trump** and the right center node is **realDonaldTrump**. This is a directed graph. The outer nodes that **Trump** and **RealDonaldTrump** are pointing to are the people they have followed. For further improvement on the graph, we can consider to add weights to the relationships or nodes when relavant data is available. Also, when you look closely, there are actually some nodes pointed (followed) both by both **Trump** and **RealDonaldTrump**. 

<a id="4.6"><h3>4.6 Let's learn to do some simple queries with cypher</h3></a><br>
The query below tries to find the people **Trump** and **realDonaldTrump** both follow

Example:
* MATCH (p1:Person)-[:LIKES]->(p2:Person)
* meaning: p1 likes p2 (with direction)



In [460]:
query = """
MATCH (p1:Person)-[:LIKES]->(p2:Person)<-[:LIKES]-(p3:Person)
WHERE p1.screen_name = {screen_name1} AND p3.screen_name = {screen_name2}
RETURN p2.screen_name
"""

data = graph.run(query,screen_name1="realDonaldTrump",screen_name2="Trump")
count = 0
for d in data:
    count=count+1
    print(d)
    
print("Number of people Trump and realDonaldTrump both follow: "+str(count))


('p2.screen_name': 'IvankaTrump')
('p2.screen_name': 'EricTrump')
('p2.screen_name': 'DonaldJTrumpJr')
('p2.screen_name': 'TrumpCharlotte')
('p2.screen_name': 'TrumpDoral')
('p2.screen_name': 'TrumpGolfLA')
('p2.screen_name': 'TrumpGolfDC')
('p2.screen_name': 'TrumpLasVegas')
('p2.screen_name': 'TrumpChicago')
('p2.screen_name': 'TrumpWaikiki')
('p2.screen_name': 'TrumpGolf')
('p2.screen_name': 'TiffanyATrump')
Number of people Trump and realDonaldTrump both follow: 12


In [436]:
#graph.delete_all() 

<a id="5"><h2>5. Final Takeaway</h2></a><br>
The last graph provides you a basic idea on how to choose an appropriate database in different scenarios. However, this doesn't mean you can only choose one from them. In fact, just like the marketing slogan for "Two Great Tastes That Taste Great Together" for Reese's Peanut Butter Cups, you may consider using two or more databases to store different data based on your needs. There is actually a fancy term for this too --- ** Polyglot persistence**.
<img src="https://image.ibb.co/mZjD27/tree.png" alt="Drawing" style="width: 800px;"/>


<a id="6"><h2>6. References</h2></a>
* MongoDB python API: http://api.mongodb.com/python/current/tutorial.html
* py2neo API: http://py2neo.org/v3/types.html
* Neo4j API: https://github.com/merqurio/neo4jupyter
* Cassandra API: http://datastax.github.io/python-driver/getting_started.html
* Redis API: https://pypi.python.org/pypi/redis
* More on redis usage in python:https://github.com/jadianes/redis-integration-patterns-python/blob/master/notebooks/Introduction%20to%20Redis%20with%20Python.ipynb
* Slides from 95-737 NoSQL Database Management