# Data Science w/Python

# HW5

<b>Penalties:</b> You will incur penalties if:
<ul>
<li>Your code is wrong</li>
<li>Your code would not work on different data</li>
<li>Your code is unnecessarily slow (you use a for loop, or you use DataFrame.apply unnecessarily, etc)
<li>You answer is composed of more than one output, unless explicitly permitted
</ul>

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import window,column,desc,col,instr,expr, pow,translate,lit
from pyspark.sql import Row
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import unix_timestamp, from_unixtime
import pandas as pd

spark = SparkSession \
    .builder \
    .appName("Foo") \
    .config("spark.executor.memory", "1g") \
    .config("spark.driver.memory", "1g") \
    .getOrCreate()
import warnings
warnings.filterwarnings('ignore')

In [2]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Data Set Description

## USA.gov Data from Bitly

In 2011, URL shortening service Bitly *Bitly.com* partnered with the US goverment website *USA.gov* to provide a feed of anonymous data gathered from users who shorten links ending with *.gov* or *.mil*.  This service is shut down at 2017. 

In this database, each line contains a common form of web data known as **JSON**, which stands for *JavaScript Object Notation* . Python has both built-in and 3rd party libraries for converting a JSON string into a Python dictionary object. Then we can use *pd.DataFrame* convert dictionary object into Dataframe for our analysis.

each line has a number of identifying attributes.

In [18]: records[0] <br>
Out[18]: <br>
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko)
Chrome/17.0.963.78 Safari/535.11', <br>
 'al': 'en-US,en;q=0.8', <br>
 'c': 'US',<br>
 'cy': 'Danvers',<br>
 'g': 'A6qOVH',<br>
 'gr': 'MA',<br>
 'h': 'wfLQtf',<br>
 'hc': 1331822918,<br>
 'hh': '1.usa.gov',<br>
 'l': 'orofrog',<br>
 'll': [42.576698, -70.954903],<br>
 'nk': 1,<br>
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',<br>
 't': 1331923247,<br>
 'tz': 'America/New_York',<br>
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}<br>
 

## Convert this dictionary *records* into DataFrame. (find out which method to use yourself)

In [3]:
records=spark.sparkContext.textFile("C:/Users/ramya/Desktop/Santa Clara University/Q3/Pyspark/usagov.txt")

df = spark.read.json(records)

### Q1.1,   How many records are in the Dataframe? (In Camino, pick the right number)

In [4]:
df.count()

3560

### Q1.2,   In column city('cy'), how many records are Nan? (In Camino, pick the right number)

In [5]:
df.filter(col("cy").isNull()).count()

641

In [6]:
df.filter(col("cy").isNotNull()).count()

2919

### Q1.3,   Exclude Nan, how many different countries this dataset include?  (In Camino, pick the right number)

In [7]:
df.select("c").filter(col("c").isNotNull()).distinct().count()

71

### Q1.4, Outside of US, what's the top 5 cities use this Bitly service? (In Camino, choose the  city name with the most usage)

In [8]:
df.filter(col("c")!="US").groupby("cy").count().sort("count",ascending=False ).select("cy").show(5)

+------+
|    cy|
+------+
|Nogata|
|London|
|Madrid|
|Mexico|
|SPaulo|
+------+
only showing top 5 rows



## Q2

### Q2.1, How many records are from Russia?  (In Camino, pick the right number)

In [9]:
df.filter(col("c")=="RU").count()

13

### Q2.2, In those records, which city has the highest usage count ?  (In Camino, pick the right city name)

In [10]:
df.filter(col("c")=="RU").groupby("cy").count().sort("count",ascending=False).show(1)

+------+-----+
|    cy|count|
+------+-----+
|Moscow|    8|
+------+-----+
only showing top 1 row



### Q2.3, In those records, how many access to cia.gov  (In Camino, pick the right number) 

In [11]:
df.select("u").filter(col("c")=="RU").where(instr(df.u,"cia.gov")>=1).show(truncate=False)

+------------------------------------------------------------------------+
|u                                                                       |
+------------------------------------------------------------------------+
|https://www.cia.gov/library/publications/world-leaders-1/index.html     |
|https://www.cia.gov/library/publications/the-world-factbook/geos/ke.html|
+------------------------------------------------------------------------+



In [12]:
df.select("u").filter(col("c")=="RU").where(instr(df.u,"cia.gov")>=1).count()

2

## Q3 We are interested in time zones in this data set (the **tz** field). 

### Q3.1, Let's clean the tz field. If it is empty, filled with 'Unknown'. If it is NaN, filled with 'Missing'.  After cleaning, how many tz field are in 'Missing' state and how many are in 'Unknown' state?  (In Camino, fill in these two numbers in the format of 'Missing'/'Unknown'. Don't leave any space in the answer.)

In [13]:
from pyspark.sql.functions import regexp_replace,when

In [14]:
df=df.withColumn("tz",when(length(col("tz"))==0,"Unknown").otherwise(df.tz))

In [15]:
df=df.withColumn("tz",when(df.tz.isNull(),"Missing").otherwise(df.tz))

In [16]:
df.select("tz").filter(col("tz")=="Missing").where(instr(df.tz,"Missing")>=1).count()

120

In [17]:
df.select("tz").filter(col("tz")=="Unknown").where(instr(df.tz,"Unknown")>=1).count()

521

### Q3.2, What's the top 10 timezone in this data set? (exclude Unknown and Missing) (In Camino, pick the 10th place timezone)

In [18]:
df.filter((col("tz")!="Missing") | (col("tz")!="Unknown")).groupby("tz").\
count().sort(col("count"),ascending=False).show()

+--------------------+-----+
|                  tz|count|
+--------------------+-----+
|    America/New_York| 1251|
|             Unknown|  521|
|     America/Chicago|  400|
| America/Los_Angeles|  382|
|      America/Denver|  191|
|             Missing|  120|
|       Europe/London|   74|
|          Asia/Tokyo|   37|
|    Pacific/Honolulu|   36|
|       Europe/Madrid|   35|
|   America/Sao_Paulo|   33|
|       Europe/Berlin|   28|
|         Europe/Rome|   27|
| America/Rainy_River|   25|
|    Europe/Amsterdam|   22|
|     America/Phoenix|   20|
|America/Indianapolis|   20|
|       Europe/Warsaw|   16|
| America/Mexico_City|   15|
|        Europe/Paris|   14|
+--------------------+-----+
only showing top 20 rows



### Q3.4,  If based on 'c' field (Country) and the 'America' keyword in 'tz' field, count how many records are from US ? (In Camino, fill in those two count numbers in format of count_by_c/count_by_tz. Don't leave any space in the answer.)

In [19]:
df.filter(col("c")=="US").count()

2305

In [24]:
df.where(instr(lower(df.tz),"america")>=1).count()

2412

### Q3.5,  Based on last question,  is both counts match ? If not, find out how many records are timezone with 'America' keyword but Country not in US? (In Camino, select the correct number)

In [21]:
df.filter(col("c")=="US").where(instr(df.tz,"America")>=1).count()

2269

In [22]:
df.filter(instr(df.tz,"America")>=1).filter(col("c")!="US").count()

143