# Accessing data from remote servers 

In the last lecture we saw how to access data from an SQL database that is stored locally. In this lecture, you will learn how to access an SQL database stored remotely on a server.

## Using Pandas With a Remote MySQL Server
### Installation
You need to "pip install" pymysql . 

Open a terminal window (not the one that is running your Jupyter notebook server) and type `pip install pymysql` or `conda install pymysql`.

### Connecting to an SQL server

You can connect to an SQL only if you have access via the firewall. 
Meaning that if you try to connect to the server of Emlyon from outside the school, it wont work (they are normally blocking port 3306 which is needed for connecting to a SQL server).

To connect to an SQL server you need to specify the following:

- `server = "xx.xxx.x.xxx" ` : The IP adress of the server or itw weblink
- `username = "xxx" `
- `password = "xxx" `
- `connection = pymysql.connect(host=server, user=username,password=password,db='database_name',charset='utf8')`
                             
### Executing a query on a Database present in the server
- `pd.read_sql('SQL_Query', connection)`

## Fetching data from Twitter 

There are several libraries that allow you to read data from Twitter. Examples include Tweepy, Twython and TwitterAPI, etc..

![TwitterAPI](img/TwitterAPIs.png)

In this lecture we will experiment with the Python Twitter API, however other libraries are also very simple to use and similar.

###  What is an API?
In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. In general terms, it is a set of clearly defined methods of communication between various software components.

### Setup
To use these libraries, you need to do the setup in the document SetupTwitterAPI.pdf.



### Python Twitter API  

#### Installation
`pip install twitter`

`conda install twitter `

#### Import

`import json` --> The output of twitter data is retrieved in a JSON-like structure

JSON (JavaScript Object Notation). JSON is a lightweight format for storing and transporting data. JSON is often used when data is sent from a server to a web page. 

`import twitter` 

`from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream `

#### Connection
`oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)`

#### Streaming Data (Watching Incoming Live Tweets)
- Initiate the connection to Twitter Streaming API

`twitter_stream = TwitterStream(auth=oauth)`

- Iterate through the streamed statuses

`iterator = twitter_stream.statuses.filter(track=" ", language=" ")`

`track` a string to track; can be a certain hashtag for example

`language` : the language of the tweets. Example "fr" for french or "en" for english


- Save the data in a variable

`mydata = []
for tweet in iterator:
    mydata.append(tweet)`
    
        
Twitter Python Tool wraps the data returned by Twitter as a TwitterDictResponse object.
Check here what's in this data: https://dev.twitter.com/overview/api/tweets

Example of assessement of the data in this object:


`mydata[0].keys()` : get the keys of the dictionary

`mydata[0]['text']` : get the statuses text

`mydata[0]['entities']['hashtags']` : get the hashtags

- Search for a tweet and control the results

`twitter = Twitter(auth=oauth)
tweets = twitter.search.tweets(q, result_type, lang, count)`

  - `result_type` can be `"recent"`; 
  - `q` : string to search for in the tweets
  - `count` : the number of results to show



This query returns a dictionary.  You need to pull out the tweets themselves from 'statuses'.

`tweets.keys() `

`for tweet in tweets['statuses']:
    print(tweet['text'])`



#### Making a DataFrame from JSON or Dict Results

This is very easy. You can also do a google search for how to handle json, dicts, etc.  We have seen all of them in class so far.


`pd.DataFrame(mydata)`

## Reading from Public Data APIs

You may need to '`conda install pandas-datareader`' if it isn't there for you.


### Import
`import pandas_datareader.data as web`

### Supported internet sources
Functions from `pandas_datareader.data` and `pandas_datareader.wb` extract data from various Internet sources into a pandas DataFrame. Currently the following sources are supported:

- <a href="https://fr.finance.yahoo.com">Yahoo Finance</a>
- Tiingo
- IEX
- Alpha Vantage
- Enigma
- Quandl
- St.Louis FED (FRED)
- Kenneth French’s data library
- World Bank
- OECD
- Eurostat
- Thrift Savings Plan
- Nasdaq Trader symbol definitions
- Stooq
- MOEX



### Read data
An example of reading data from a public data API:
`f = web.DataReader("dataset_name", 'api_name', start, end)`
  - `start`: start date and/or time
  - `end`: end date and time
  
### Documentation and examples:
You should do the following tutorial to get an overview of the supported APIs and implemented functions for public data API's assessement:

https://pandas-datareader.readthedocs.io/en/latest/remote_data.html




Data from Yahoo Finance

In [4]:
#Data from Yahoo Finance
import pandas_datareader.data as web
 
aapl = web.DataReader("AAPL", "yahoo")
    

Price and volume data from IEX


In [18]:
tops = web.DataReader(["GS", "AAPL"], "iex-tops")
   

Top of book executions from IEX

In [19]:

gs = web.DataReader("GS", "iex-last")
    

Real-time depth of book data from IEX


In [20]:
gs = web.DataReader("GS", "iex-book")
   

 Data from FRED


In [21]:
vix = web.DataReader("VIXCLS", "fred")
   

  Data from Fama/French


In [22]:
ff = web.DataReader("F-F_Research_Data_Factors", "famafrench")
ff = web.DataReader("F-F_Research_Data_Factors_weekly", "famafrench")
ff = web.DataReader("6_Portfolios_2x3", "famafrench")
ff = web.DataReader("F-F_ST_Reversal_Factor", "famafrench")

## Import data from a URL 

If data is stored on a server in a csv format, and you have the url, then you can use the read_csv function:
`pd.read_csv('https://url_path/data.csv')` 

### Loading JSON Data

JSON, also known as JavaScript Object Notation, is a data-interchange text-serialization format. JSON is easy to read and write. It is based on a subset of the JavaScript Programming Language but uses conventions from Python, and many other languages outside of Python.
JSON is mostly used to store unstructured data, and SQL databases have a tough time saving it. JSON makes the data accessible for the machines to read.
JSON is mainly built on two structures:
A collection of key/value pairs. In Python, a key/value pair is referred to as a Dictionary, and a key is a unique attribute, whereas values are not.
An ordered list of values. The ordered list can also sometimes be a list of lists. Lists in Python are a set of values which can be a string, integer, etc.

If data is stored on a server in simple json format:
`pd.read_json('https://url_path/data.json')` 

### Nested JSON Data
Another type of JSON datasets is the nested JSON structure. Nested JSON structure means that each key can have more keys associated with it. Here is an example.

![nested_json](img/nested_json.png)

In the above dataset, you can observe that article and blog are two primary keys under which there are values of these keys. These values have their own key-value pair combination.
Note that the above dataset is enclosed with double-quotes and is in the form of a string.
Reading a nested JSON can be done in multiple ways.
First, you will use the json.loads function to read a JSON string by passing the data variable as a parameter to it. Then, you will use the `json_normalize` function to flatten the nested JSON data into a table.
You will import the `json_normalize` function from the `pandas.io.json library`.

`import urllib.request, json
import pandas as pd
import json_normalize from pandas.io.json` 

`with urllib.request.urlopen("url_path") as url:
    nested_data = json.loads(url.read())
    data = json_normalize(nested_data,record_path ='key_name')`

The `urllib.request` module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

`urllib.request.urlopen` opens a URL.

`json.loads()` loads the data as json format.

### Loading HTML Data
HTML is a Hypertext Markup Language that is mainly used for created web applications and pages. It tries to describe the structure of the web page semantically. The web browser receives an HTML document from a web server and renders it to a multimedia web page.

For web applications, HTML is used with cascading style sheets (CSS) while at the server end it collaborates with various web server frameworks like Flask, Django, etc.

HTML uses tags to define each block of code like a $<p></p>$ tag for the start and end of a paragraph, <image></image> tag for adding content to the web page as an image and similarly there are many tags that together collate to form an HTML web page.

To read an HTML file, pandas dataframe looks for a tag. That tag is called a $<td></td>$ tag. This tag is used for defining a table in HTML.
pandas uses `read_html()` to read the HTML document.

So, whenever you pass an HTML to pandas and expect it to output a nice looking dataframe, make sure the HTML page has a table in it!

Examples:


In [90]:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
html



In [91]:
df=pd.read_html(html)
df

[                                                   0
 0  Python provides convenience and flexibility fo...]

In [92]:
len(df), type(df)

(1, list)

In [93]:
df[0]

Unnamed: 0,0
0,Python provides convenience and flexibility fo...


In the following example, you will be using a Cryptocurrency website as an HTML dataset that has various crypto coins on it and has various details about each coin like:
- Last price of the coin (Last price)
- Whether the coin price has increased or decreased (in percentage %)
- The volume of 24 hours (how many coins were traded 24 volume)
- Total number of coins (# Coins)

You will first import requests library which will help you in sending a request to the URL from where you want to fetch the HTML content.

In [94]:
import requests
url = 'https://www.worldcoinindex.com/'
crypto_url = requests.get(url)
crypto_url


<Response [200]>

So, until now, you defined the URL and then using requests.get() you sent a request to that URL and received a response as an acknowledgement [200] OK which means that you were able to connect with that web server.

Now, to read the content of the HTML web page, all you need to do is call crypto_url.text which will give you the HTML code of that cryptocurrency web page.

Finally, you will pass crypto_url.text to the pd.read_html() function which will return you a list of dataframes where each element in that list is a table (dataframe) the cryptocurrency webpage has in it.


In [95]:
crypto_url.text

'\r\n<!DOCTYPE html>\r\n<html style="background-color: #FFF">\r\n<head>\r\n\r\n    <meta http-equiv="Content-Type" content="text/html">\r\n    <meta name="mobile-web-app-capable" content="yes">\r\n    <meta charset="utf-8" />\r\n  \r\n    <meta name="description" content="Cryptocoins ranked by 24hr trading volume, price info, charts, market cap and news" />\r\n        <meta name="keywords" content="Coin, index, worldcoin, coinindex, worldindex, cryptocoins, crypto, traded, exchanges, ranked, marketcap, coins, cryptocurrency ,coinprices, price, last trade time, trade, market cap, trading, traded, volume, rate, buy cryptocurrency, sell cryptocurrency" />\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable = no">\r\n    <meta name="apple-mobile-web-app-capable" content="yes" />\r\n    <meta name="propeller" content="efad6eed15cdd45139ef34d287cc06ff" />\r\n    \r\n        <meta property="og:image" content="https://www.worldcoinindex.com/content/img/worl

In [96]:
crypto_data = pd.read_html(crypto_url.text)
crypto_data

[                                                     #  \
 0                                                    1   
 1                                                    2   
 2                                                    3   
 3                                                    4   
 4                                                    5   
 5                                                    6   
 6                                                    7   
 7                                                    8   
 8                                                    9   
 9                                                   10   
 10   googletag.cmd.push(function() { googletag.disp...   
 11                                                  11   
 12                                                  12   
 13                                                  13   
 14                                                  14   
 15                                                  15 

Let's print the length and the type of the dataframe. The type should be a list.


In [97]:
len(crypto_data), type(crypto_data)


(1, list)

From the above output, it is clear that there is only 1 table with a type list.


In [98]:
crypto_data = crypto_data[0]
crypto_data.head()

Unnamed: 0,#,Unnamed: 1,Name,Ticker,Last price,%,24 high,24 low,Price Charts 7d,24 volume,# Coins,Market cap
0,1,,Bitcoin,BTC,"$ 11,557",+1.61%,"$ 11,579","$ 11,233",,$ 4.76B,18.51M,$ 213.98B
1,2,,Ethereum,ETH,$ 387.41,+3.32%,$ 388.35,$ 366.72,,$ 4.13B,112.97M,$ 43.76B
2,3,,Chainlink,LINK,$ 11.44,+5.73%,$ 11.57,$ 10.65,,$ 686.65M,350.00M,$ 4.00B
3,4,,Litecoin,LTC,$ 51.18,+1.49%,$ 51.35,$ 49.51,,$ 601.82M,65.65M,$ 3.36B
4,5,,Eos,EOS,$ 2.67,+0.48%,$ 2.68,$ 2.62,,$ 564.78M,1.01B,$ 2.72B


Let's remove the first and second column since they do not have any useful information in them and keep all the rows.

In [99]:
crypto_final = crypto_data.iloc[:,2:]
crypto_final.head()

Unnamed: 0,Name,Ticker,Last price,%,24 high,24 low,Price Charts 7d,24 volume,# Coins,Market cap
0,Bitcoin,BTC,"$ 11,557",+1.61%,"$ 11,579","$ 11,233",,$ 4.76B,18.51M,$ 213.98B
1,Ethereum,ETH,$ 387.41,+3.32%,$ 388.35,$ 366.72,,$ 4.13B,112.97M,$ 43.76B
2,Chainlink,LINK,$ 11.44,+5.73%,$ 11.57,$ 10.65,,$ 686.65M,350.00M,$ 4.00B
3,Litecoin,LTC,$ 51.18,+1.49%,$ 51.35,$ 49.51,,$ 601.82M,65.65M,$ 3.36B
4,Eos,EOS,$ 2.67,+0.48%,$ 2.68,$ 2.62,,$ 564.78M,1.01B,$ 2.72B


## Documentation
If you want to use twitter's API to collect your own tweets about a subject, here is a good document:
It requires you to create your own API keys with your own twitter account.

- Tweepy: http://socialmedia-class.org/twittertutorial.html
- Python Twitter API : https://pypi.org/project/twitter/
- Incredibly cool: https://dev.twitter.com/overview/api/entities-in-twitter-objects
- Web data access: https://pandas-datareader.readthedocs.io/en/latest/remote_data.html




# In-Class Exercises
## Connection to a remote DataBase and Queries

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
import pymysql

In [3]:
server = "mysql1005.mochahost.com"#"10.126.8.140"
username = "hanansal"#"student1"
password = "b7ebl7@yetktir"#"student1"
## we have 2 others - emlyon2 and student2, and emlyon3 and student3

### Connect to the server to read from the database `movies`

In [4]:
connection = pymysql.connect(host=server,
                             user=username,
                             password=password,
                             db='hanansal_movies',
                             charset='utf8')


### Use the query `'SHOW tables'` to show the tables present in the movies database

In [5]:
pd.read_sql('SHOW tables', connection)

Unnamed: 0,Tables_in_hanansal_movies
0,country
1,department
2,gender
3,genre
4,keyword
5,language
6,language_role
7,movie
8,movie_cast
9,movie_company


### Write a query to show the content of the table movie

In [6]:
pd.read_sql('Select * from movie', connection)

Unnamed: 0,movie_id,title,budget,homepage,overview,popularity,release_date,revenue,runtime,movie_status,tagline,vote_average,vote_count
0,5,Four Rooms,4000000,,It's Ted the Bellhop's first night on the job....,22.876230,1995-12-09,4300000,98,Released,Twelve outrageous guests. Four scandalous requ...,6.5,530
1,11,Star Wars,11000000,http://www.starwars.com/films/star-wars-episod...,Princess Leia is captured and held hostage by ...,126.393695,1977-05-25,775398007,121,Released,"A long time ago in a galaxy far, far away...",8.1,6624
2,12,Finding Nemo,94000000,http://movies.disney.com/finding-nemo,"Nemo, an adventurous young clownfish, is unexp...",85.688789,2003-05-30,940335536,100,Released,"There are 3.7 trillion fish in the ocean, they...",7.6,6122
3,13,Forrest Gump,55000000,,A man with a low IQ has accomplished great thi...,138.133331,1994-07-06,677945399,142,Released,"The world will never be the same, once you've ...",8.2,7927
4,14,American Beauty,15000000,http://www.dreamworks.com/ab/,"Lester Burnham, a depressed suburban father in...",80.878605,1999-09-15,356296601,122,Released,Look closer.,7.9,3313
5,16,Dancer in the Dark,12800000,,"Selma, a Czech immigrant on the verge of blind...",22.022228,2000-05-17,40031879,140,Released,You don't need eyes to see.,7.6,377
6,18,The Fifth Element,90000000,,"In 2257, a taxi driver is unintentionally give...",109.528572,1997-05-07,263920180,126,Released,There is no future without it.,7.3,3885
7,19,Metropolis,92620000,,In a futuristic city sharply divided between t...,32.351527,1927-01-10,650422,153,Released,There can be no understanding between the hand...,8.0,657
8,20,My Life Without Me,0,http://www.clubcultura.com/clubcine/clubcineas...,A Pedro Almodovar production in which a fatall...,7.958831,2003-03-07,9726954,106,Released,,7.2,77
9,22,Pirates of the Caribbean: The Curse of the Bla...,140000000,http://disney.go.com/disneyvideos/liveaction/p...,"Jack Sparrow, a freewheeling 17th-century pira...",271.972889,2003-07-09,655011224,143,Released,Prepare to be blown out of the water.,7.5,6985


### Write a query that selects all from the table genre 

In [39]:
pd.read_sql('Select * from genre', connection)

Unnamed: 0,genre_id,genre_name
0,12,Adventure
1,14,Fantasy
2,16,Animation
3,18,Drama
4,27,Horror
5,28,Action
6,35,Comedy
7,36,History
8,37,Western
9,53,Thriller


### Write a query that selects all from the movie_genres where the type is Action. 

In [69]:
action_movies = pd.read_sql("Select * from movie_genres where genre_id = 28",connection)
action_movies.head()


NameError: name 'connection' is not defined

### Write a query to show the first 10 rows of the table `person`

In [46]:
SQL = "Select * from person Limit 10"
persons = pd.read_sql(SQL, connection)
persons

## Twitter API


In [5]:
import json
import twitter
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream

# Variables that contains the user credentials to access Twitter API
# register for these at apps.twitter.com.
ACCESS_TOKEN = '485710952-SOTOradsfYJ2bQJBmLdD3y20SeQfTP5iFQZN3YEQ'
ACCESS_SECRET = 'gKjO7Ouqn2fUn3ZxV6DLf7jVWL1OiUi8s0DKrtIkWxWMV'
CONSUMER_KEY = 'z5qG7TjeuWZhzyUbrdqutNc1D'
CONSUMER_SECRET = 'lfCHuzAfhSLZccqrUkwR2Ud95PfCcXo4SwXUP23DSsRxnSJaga'



oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)



### Initiate the connection to Twitter Streaming API. Track `#bigdata` in english.

In [6]:
# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)
iterator = twitter_stream.statuses.filter(track="#bigdata", language="en")

### Create an iterator to retrieve the data. You can set a limit to a small number.

In [7]:
# set a limit because you will be stopped at a certain number
tweet_count = 2  # low because it's just to show the data
mydata = []
for tweet in iterator:
    tweet_count -= 1
    # Twitter Python Tool wraps the data returned by Twitter 
    # as a TwitterDictResponse object.
    mydata.append(tweet)
    if tweet_count <= 0:
        break 

In [8]:
mydata

[{'created_at': 'Mon Dec 07 15:11:24 +0000 2020',
  'id': 1335965139519086596,
  'id_str': '1335965139519086596',
  'text': 'RT @mvollmer1: How far can your personal data go?\n#IoT and #ArtificialIntelligence are continually evolving, so what will happen with #BigD…',
  'source': '<a href="https://homeinc.com" rel="nofollow">forRetweeting</a>',
  'truncated': False,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'in_reply_to_screen_name': None,
  'user': {'id': 1142424032794406912,
   'id_str': '1142424032794406912',
   'name': 'Cyber Security News',
   'screen_name': 'CyberSecurityN8',
   'location': None,
   'url': None,
   'description': 'The place for InfoSec, CyberSecurity, DevSecOps, DataSecurity and many more!!! Stay tuned.',
   'translator_type': 'none',
   'protected': False,
   'verified': False,
   'followers_count': 8966,
   'friends_count': 1,
   'listed_count': 117,
   'favourites_c

### Show the keys of the first item

In [56]:
mydata[0].keys()

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])

### Show the tweet of the first item

In [57]:
mydata[0]['text']

'RT @Ronald_vanLoon: Introducing Firebase #MachineLearning\nby @Firebase\n\n#BigData #ArtificialIntelligence #ML #AI #DataScience #DeepLearning…'

### Show the entities 

In [58]:
mydata[0]['entities']

{'hashtags': [{'text': 'MachineLearning', 'indices': [41, 57]},
  {'text': 'BigData', 'indices': [72, 80]},
  {'text': 'ArtificialIntelligence', 'indices': [81, 104]},
  {'text': 'ML', 'indices': [105, 108]},
  {'text': 'AI', 'indices': [109, 112]},
  {'text': 'DataScience', 'indices': [113, 125]},
  {'text': 'DeepLearning', 'indices': [126, 139]}],
 'urls': [],
 'user_mentions': [{'screen_name': 'Ronald_vanLoon',
   'name': 'Ronald van Loon',
   'id': 555031989,
   'id_str': '555031989',
   'indices': [3, 18]},
  {'screen_name': 'Firebase',
   'name': 'Firebase',
   'id': 447644824,
   'id_str': '447644824',
   'indices': [61, 70]}],
 'symbols': []}

### Show the hastags from the entities

In [59]:
mydata[0]['entities']['hashtags']

[{'text': 'MachineLearning', 'indices': [41, 57]},
 {'text': 'BigData', 'indices': [72, 80]},
 {'text': 'ArtificialIntelligence', 'indices': [81, 104]},
 {'text': 'ML', 'indices': [105, 108]},
 {'text': 'AI', 'indices': [109, 112]},
 {'text': 'DataScience', 'indices': [113, 125]},
 {'text': 'DeepLearning', 'indices': [126, 139]}]

### Convert the data you fetched into a dataframe

In [60]:
import pandas as pd
pd.DataFrame(mydata)

Unnamed: 0,contributors,coordinates,created_at,entities,favorite_count,favorited,filter_level,geo,id,id_str,...,quote_count,reply_count,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,,,Mon Oct 12 12:39:21 +0000 2020,"{'hashtags': [{'text': 'MachineLearning', 'ind...",0,False,low,,1315633150924656640,1315633150924656640,...,0,0,0,False,{'created_at': 'Sat Oct 10 00:58:55 +0000 2020...,"<a href=""http://twitter.com/download/android"" ...",RT @Ronald_vanLoon: Introducing Firebase #Mach...,1602506361119,False,"{'id': 890844439442325504, 'id_str': '89084443..."
1,,,Mon Oct 12 12:39:24 +0000 2020,"{'hashtags': [{'text': 'MachineLearning', 'ind...",0,False,low,,1315633164870660096,1315633164870660096,...,0,0,0,False,{'created_at': 'Sat Oct 10 00:21:15 +0000 2020...,"<a href=""http://twitter.com/download/android"" ...",RT @Ronald_vanLoon: Recolorizing Overwatch Foo...,1602506364444,False,"{'id': 890844439442325504, 'id_str': '89084443..."


### Search for the 10 most recent tweets with "#nlp" 

In [61]:
twitter = Twitter(auth=oauth)
nlptweets = twitter.search.tweets(q='#nlp', result_type='recent', lang='en', count=10)

In [62]:
nlptweets

{'statuses': [{'created_at': 'Mon Oct 12 12:40:23 +0000 2020',
   'id': 1315633414402306048,
   'id_str': '1315633414402306048',
   'text': 'RT @stpiindia: #ComputerVision &amp; #NLP can accelerate remote diagnosis of patients &amp; rev up treatment process by analysing symptoms &amp; vital…',
   'truncated': False,
   'entities': {'hashtags': [{'text': 'ComputerVision', 'indices': [15, 30]},
     {'text': 'NLP', 'indices': [37, 41]}],
    'symbols': [],
    'user_mentions': [{'screen_name': 'stpiindia',
      'name': 'STPI',
      'id': 2713937910,
      'id_str': '2713937910',
      'indices': [3, 13]}],
    'urls': []},
   'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
   'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
   'in_reply_to_status_id': None,
   'in_reply_to_status_id_str': None,
   'in_reply_to_user_id': None,
   'in_reply_to_user_id_str': None,
   'in_reply_to_screen_name': None,
   'user': {'id': 727396612167864320

### Check for the keys

In [63]:
# this query returns a dictionary.  You need to pull out the tweets themselves from 'statuses'
nlptweets.keys()

dict_keys(['statuses', 'search_metadata'])

### Print the tweets you fetched by iterating on the statuses 

In [64]:
for tweet in nlptweets['statuses']:
    print(tweet['text'])

RT @stpiindia: #ComputerVision &amp; #NLP can accelerate remote diagnosis of patients &amp; rev up treatment process by analysing symptoms &amp; vital…
ROI concerns helm the rising rage. However, talent shortages still persist. So, if you intend to implement AI in yo… https://t.co/09em17fqQT
RT @bebrighterlife: I'm just gettin into blogging, thanks to @NatWriterForYou and as workplace anxiety is a common issue i see incidents, i…
RT @DreadBong0: "Data .World have around 800,000 #dataset consumers"

✅ Data .World
✅ Citi
✅ UBS
✅ Trading Desks
✅ Universities

"We are pu…
RT @IainLJBrown: Transforming the telecom industry with Artificial Intelligence - https://t.co/B688nowXKb

Read more here: https://t.co/5kV…
RT @IainLJBrown: 10 Essential Leadership Qualities For The Age Of Artificial Intelligence - Forbes

Read more here: https://t.co/48QcuLi4j8…
RT @IainLJBrown: UK and US enter special deal to develop Artificial Intelligence and thwart China - Express

Read more here: https://t.co/X…
RT

### Convert the statuses to dataframe

In [65]:
import pandas as pd
df = pd.DataFrame(nlptweets['statuses'])
df.head()

Unnamed: 0,contributors,coordinates,created_at,entities,favorite_count,favorited,geo,id,id_str,in_reply_to_screen_name,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,truncated,user
0,,,Mon Oct 12 12:40:23 +0000 2020,"{'hashtags': [{'text': 'ComputerVision', 'indi...",0,False,,1315633414402306048,1315633414402306048,,...,,,,99,False,{'created_at': 'Mon Oct 12 06:21:33 +0000 2020...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @stpiindia: #ComputerVision &amp; #NLP can ...,False,"{'id': 727396612167864320, 'id_str': '72739661..."
1,,,Mon Oct 12 12:39:37 +0000 2020,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,False,,1315633221690781696,1315633221690781696,,...,,,,0,False,,"<a href=""https://mobile.twitter.com"" rel=""nofo...","ROI concerns helm the rising rage. However, ta...",True,"{'id': 781059970531991552, 'id_str': '78105997..."
2,,,Mon Oct 12 12:38:59 +0000 2020,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,False,,1315633058842906624,1315633058842906624,,...,,,,1,False,{'created_at': 'Mon Oct 12 11:54:37 +0000 2020...,"<a href=""http://twitter.com/download/iphone"" r...",RT @bebrighterlife: I'm just gettin into blogg...,False,"{'id': 937138441, 'id_str': '937138441', 'name..."
3,,,Mon Oct 12 12:38:41 +0000 2020,"{'hashtags': [{'text': 'dataset', 'indices': [...",0,False,,1315632986042232836,1315632986042232836,,...,,,,2,False,{'created_at': 'Mon Oct 12 12:12:08 +0000 2020...,"<a href=""http://twitter.com/download/iphone"" r...","RT @DreadBong0: ""Data .World have around 800,0...",False,"{'id': 750183931018776576, 'id_str': '75018393..."
4,,,Mon Oct 12 12:36:42 +0000 2020,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,False,,1315632486395846656,1315632486395846656,,...,,,,10,False,{'created_at': 'Mon Oct 12 03:53:39 +0000 2020...,"<a href=""http://twitter.com/download/iphone"" r...",RT @IainLJBrown: Transforming the telecom indu...,False,"{'id': 76115291, 'id_str': '76115291', 'name':..."


Ideally, in a good analysis, you would extract items from entities too.  Think about functions to apply to make new columns, or a new dataframe using the same tweet id as index and the entities as columns.

## Reading from Public Data APIs

In [66]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


In [67]:
import pandas_datareader.data as web
import datetime

In [68]:
type(web)

module

### Read the Apple Inc. data from Yahoo Finance. Set a start and end date as you like (use the datetime() function)
 https://fr.finance.yahoo.com/quote/AAPL/

In [69]:
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)

f = web.DataReader("AAPL", 'yahoo', start, end)#



In [70]:
f.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-04,7.660714,7.585,7.6225,7.643214,493729600.0,6.604801
2010-01-05,7.699643,7.616071,7.664286,7.656428,601904800.0,6.616219
2010-01-06,7.686786,7.526786,7.656428,7.534643,552160000.0,6.51098
2010-01-07,7.571429,7.466072,7.5625,7.520714,477131200.0,6.498945
2010-01-08,7.571429,7.466429,7.510714,7.570714,447610800.0,6.54215


### What is the stock data for the date `2010-01-04`
Hint: use `.loc()`

In [71]:
f.loc['2010-01-04']

High         7.660714e+00
Low          7.585000e+00
Open         7.622500e+00
Close        7.643214e+00
Volume       4.937296e+08
Adj Close    6.604801e+00
Name: 2010-01-04 00:00:00, dtype: float64

Another example: Read tran_sf_railac dataset from the eurostat:
http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=tran_sf_railac&lang=en

In [72]:
df = web.DataReader("tran_sf_railac", 'eurostat')


In [73]:
df

ACCIDENT,"Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge","Collisions of trains, including collisions with obstacles within the clearance gauge",...,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
UNIT,Number,Number,Number,Number,Number,Number,Number,Number,Number,Number,...,Number,Number,Number,Number,Number,Number,Number,Number,Number,Number
GEO,Austria,Belgium,Bulgaria,Switzerland,Channel Tunnel,Czechia,Germany (until 1990 former territory of the FRG),Denmark,Estonia,Greece,...,Netherlands,Norway,Poland,Portugal,Romania,Sweden,Slovenia,Slovakia,Turkey,United Kingdom
FREQ,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,...,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual
TIME_PERIOD,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4
2016-01-01,7.0,2.0,3.0,2.0,0.0,6.0,29.0,3.0,3.0,1.0,...,,,,,,,,,0.0,
2017-01-01,7.0,1.0,1.0,3.0,0.0,11.0,38.0,2.0,4.0,1.0,...,,,,,,,,,0.0,
2018-01-01,4.0,0.0,1.0,1.0,0.0,6.0,40.0,1.0,0.0,2.0,...,,,,,,,,,0.0,


In [74]:
df.columns

MultiIndex(levels=[['Accidents to persons caused by rolling stock in motion', 'Collisions of trains, including collisions with obstacles within the clearance gauge', 'Derailments of trains', 'Fires in rolling stock', 'Level crossing accidents', 'Others', 'Total', 'Unknown'], ['Number'], ['Austria', 'Belgium', 'Bulgaria', 'Channel Tunnel', 'Croatia', 'Czechia', 'Denmark', 'Estonia', 'European Union - 27 countries (from 2020)', 'European Union - 28 countries (2013-2020)', 'Finland', 'France', 'Germany (until 1990 former territory of the FRG)', 'Greece', 'Hungary', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Montenegro', 'Netherlands', 'North Macedonia', 'Norway', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey', 'United Kingdom'], ['Annual']],
           codes=[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

Sometimes it's easier to read... but un-nesting this would be a good challenge :)

In [75]:
# 
df.transpose()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,TIME_PERIOD,2016-01-01 00:00:00,2017-01-01 00:00:00,2018-01-01 00:00:00
ACCIDENT,UNIT,GEO,FREQ,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Austria,Annual,7.0,7.0,4.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Belgium,Annual,2.0,1.0,0.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Bulgaria,Annual,3.0,1.0,1.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Switzerland,Annual,2.0,3.0,1.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Channel Tunnel,Annual,0.0,0.0,0.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Czechia,Annual,6.0,11.0,6.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Germany (until 1990 former territory of the FRG),Annual,29.0,38.0,40.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Denmark,Annual,3.0,2.0,1.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Estonia,Annual,3.0,4.0,0.0
"Collisions of trains, including collisions with obstacles within the clearance gauge",Number,Greece,Annual,1.0,1.0,2.0


## Import data from a URL 
### Import the data from the following url: 
https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv




In [67]:
df = pd.read_csv('https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv') 

df.head()

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA


### Import data from this url: 
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

In [68]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None) 
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


### Import data from this url 

https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json

Note that data is in json format.
Notice that the data is in a key-value dictionary format. There are a total of three keys: namely integer, datetime, and category.
First, you will import the pandas library and then pass the URL to the pd.read_json() which will return a dataframe. The columns of the dataframes represent the keys, and the rows are the values of the JSON.

In [66]:
json = pd.read_json('https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json')
json.head()

Unnamed: 0,integer,datetime,category
0,5,2015-01-01 00:00:00,0
1,5,2015-01-01 00:00:01,0
10,5,2015-01-01 00:00:10,0
11,5,2015-01-01 00:00:11,0
12,8,2015-01-01 00:00:12,0


### Import data from the following url containing data in nested json format.
Copy <a href="https://newsapi.org/v2/everything?q=timesofindia&from=2020-10-1&sortBy=publishedAt&apiKey=46f1d0835e7a48e3bb9859008515ba6f">this link.</a> It is a link containing news articles. It is in json format. Notice that the key "articles" contains other keys with values associated to them. Read the content  of "articles" into a dataframe.



In [1]:
import urllib.request, json
import pandas as pd
with urllib.request.urlopen("https://newsapi.org/v2/everything?q=timesofindia&from=2020-12-1&sortBy=publishedAt&apiKey=46f1d0835e7a48e3bb9859008515ba6f") as url:
    data = json.loads(url.read())
data

{'status': 'ok',
 'totalResults': 2,
 'articles': [{'source': {'id': None, 'name': 'Vnexpress.net'},
   'author': 'VnExpress',
   'title': 'Chồng thử nghiệm vaccine Covid-19 sau cái chết của vợ',
   'description': 'Nigel Demaline luôn nghĩ rằng mình sẽ chết trước Pauline bởi bà rất khỏe mạnh. Không ngờ Covid-19 cướp đi vợ và đưa chồng vào cuộc thử nghiệm vaccine.',
   'url': 'https://vnexpress.net/chong-thu-nghiem-vaccine-covid-19-sau-cai-chet-cua-vo-4201989.html',
   'urlToImage': 'https://vcdn-suckhoe.vnecdn.net/2020/12/05/vaccinecovid19-1607162729-4439-1607162793.jpg?w=0&h=0&q=100&dpr=1&fit=crop&s=gSIsn_C0JK7o4qdZJoWrxQ',
   'publishedAt': '2020-12-06T00:12:00Z',
   'content': 'AnhNigel Demaline luôn ngh rng mình s cht trc Pauline bi bà rt khe mnh. Không ng Covid-19 cp i v và a chng vào cuc th nghim vaccine.Nigel chia s câu chuyn khin ông tr thành tình nguyn viên tham gia t… [+4397 chars]'},
  {'source': {'id': 'the-times-of-india', 'name': 'The Times of India'},
   'author': 'PTI',

In [2]:
from pandas.io.json import json_normalize  

nested_full = json_normalize(data)
nested_full

Unnamed: 0,articles,status,totalResults
0,"[{'source': {'id': None, 'name': 'Vnexpress.ne...",ok,2


In [3]:
from pandas.io.json import json_normalize  


articles = json_normalize(data,record_path ='articles')
articles

Unnamed: 0,author,content,description,publishedAt,source,title,url,urlToImage
0,VnExpress,AnhNigel Demaline luôn ngh rng mình s cht trc ...,Nigel Demaline luôn nghĩ rằng mình sẽ chết trư...,2020-12-06T00:12:00Z,"{'id': None, 'name': 'Vnexpress.net'}",Chồng thử nghiệm vaccine Covid-19 sau cái chết...,https://vnexpress.net/chong-thu-nghiem-vaccine...,https://vcdn-suckhoe.vnecdn.net/2020/12/05/vac...
1,PTI,The Delhi Sikh Gurdwara Management Committee (...,The Delhi Sikh Gurdwara Management Committee (...,2020-12-04T03:49:43Z,"{'id': 'the-times-of-india', 'name': 'The Time...",DSGMC sends legal notice to Kangana,https://timesofindia.indiatimes.com/entertainm...,"https://static.toiimg.com/thumb/msid-79559104,..."


Another method:

In [28]:
import requests
import urllib.request, json
from pandas.io.json import json_normalize

#url = ('https://newsapi.org/v2/everything?''sources=abc-news&amp;''apiKey=9866ad15e2254534ad488dad712cd185')
url = ('https://newsapi.org/v2/everything?q=timesofindia&from=2020-10-1&sortBy=publishedAt&apiKey=46f1d0835e7a48e3bb9859008515ba6f')

r_json = requests.get(url).json()
r_json


{'status': 'ok',
 'totalResults': 8,
 'articles': [{'source': {'id': None, 'name': 'Detik.com'},
   'author': 'Defara Millenia Romadhona',
   'title': '4 Ciri Kecanduan Masturbasi, Pernah Mengalami Salah Satunya?',
   'description': 'Meski dianggap punya manfaat kesehatan, masturbasi bisa menghadirkan berbagai masalah jika dilakukan berlebihan. Apalagi sampai ada yang kecanduan.',
   'url': 'https://health.detik.com/sexual-health/d-5207191/4-ciri-kecanduan-masturbasi-pernah-mengalami-salah-satunya',
   'urlToImage': 'https://awsimages.detik.net.id/api/wm/2017/05/16/2116d494-b740-4da3-b227-75d860d53199_169.jpg?wid=54&w=650&v=1&t=jpeg',
   'publishedAt': '2020-10-09T11:30:47Z',
   'content': 'Jakarta - Meski dianggap punya manfaat kesehatan, masturbasi bisa menghadirkan berbagai masalah terutama jika dilakukan berlebihan. Tidak hanya dampak secara fisik, melainkan juga psikis.\r\nDampak fis… [+1407 chars]'},
  {'source': {'id': None, 'name': 'Thefrisky.com'},
   'author': 'Wendy Stokes',

`json_normalize`: Normalize semi-structured JSON data into a flat tablejson_normalize.



In [64]:
df = json_normalize(r_json,record_path ='articles')
df


Unnamed: 0,author,content,description,publishedAt,source,title,url,urlToImage
0,Defara Millenia Romadhona,Jakarta - Meski dianggap punya manfaat kesehat...,"Meski dianggap punya manfaat kesehatan, mastur...",2020-10-09T11:30:47Z,"{'id': None, 'name': 'Detik.com'}","4 Ciri Kecanduan Masturbasi, Pernah Mengalami ...",https://health.detik.com/sexual-health/d-52071...,https://awsimages.detik.net.id/api/wm/2017/05/...
1,Wendy Stokes,One of the best ways to relax the brain and de...,One of the best ways to relax the brain and de...,2020-10-06T13:54:17Z,"{'id': None, 'name': 'Thefrisky.com'}",These New York Times Best Sellers Are the Most...,https://thefrisky.com/these-new-york-times-bes...,https://thefrisky.com/wp-content/uploads/2020/...
2,Odilia WS,Jakarta - Warna-warni cantik alat masak sangat...,Warna-warni cantik alat masak sangat menggoda....,2020-10-06T10:38:40Z,"{'id': None, 'name': 'Detik.com'}",Apakah Alat Masak Silikon Aman Bagi Kesehatan?,https://food.detik.com/info-kuliner/d-5201759/...,https://awsimages.detik.net.id/api/wm/2020/10/...
3,Rohan Venkataramakrishnan,Welcome to The Political Fix by Rohan Venkatar...,<ol><li>The Political Fix: Five lenses through...,2020-10-05T04:49:00Z,"{'id': None, 'name': 'Scroll.in'}",The Political Fix: Five lenses through which t...,https://scroll.in/article/974923/the-political...,https://s01.sgp1.cdn.digitaloceanspaces.com/bo...
4,خبرگزاری جمهوری اسلامی |صفحه اصلی | IRNA News ...,""" "" : "" . : "" "" . . . . : "" "" . \r\n . . . . ...",دهلی نو-ایرنا- هارش شارینگلا معاون وزیر امور خ...,2020-10-04T10:42:40Z,"{'id': None, 'name': 'Irna.ir'}",تاکید هند بر عدم استفاده اول از سلاح اتمی,https://www.irna.ir/news/84064355/تاکید-هند-بر...,https://img9.irna.ir/d/r2/2020/10/04/4/1576552...
5,PTI,"If not this time, then we will try again: Cong...",A delegation of Congress MPs led by former par...,2020-10-03T10:01:07Z,"{'id': 'the-times-of-india', 'name': 'The Time...","Rahul, Priyanka head to Hathras again to meet ...",https://timesofindia.indiatimes.com/india/rahu...,"https://static.toiimg.com/thumb/msid-78461833,..."
6,,"According to Live Law, the court has also dire...",<ol><li>Hathras gangrape: Allahabad HC takes s...,2020-10-01T16:43:34Z,"{'id': None, 'name': 'Firstpost'}",Hathras gangrape: Allahabad HC takes suo moto ...,https://www.firstpost.com/india/hathras-gangra...,https://images.firstpost.com/wp-content/upload...
7,Defara Millenia Romadhona,Jakarta - Maag disebabkan tingginya kadar asam...,Asam lambung dapat dihindari dengan mengatur p...,2020-10-01T01:35:23Z,"{'id': None, 'name': 'Detik.com'}",4 Menu Sarapan yang Picu Asam Lambung Naik,https://health.detik.com/berita-detikhealth/d-...,https://awsimages.detik.net.id/api/wm/2019/05/...
