In [1]:
import requests, json

github_user = "shars95"
endpoint = f"https://api.github.com/users/{github_user}/ML_Projects"

repos = json.loads(requests.get(endpoint).text)

In [2]:
repos

{'message': 'Not Found',
 'documentation_url': 'https://developer.github.com/v3'}

In [71]:
import os,cv2

In [12]:
os.chdir(r"C:/Users/shars/OneDrive/Desktop/Python_workspace/dataset")

In [13]:
import csv
with open("a.txt") as f:
    lines=list(csv.reader(f))

In [14]:
header,values=lines[0],lines[1:]

In [15]:
header

['a', 'b', 'c']

In [16]:
data_dict = {h: v for h, v in zip(header, zip(*values))}

In [17]:
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

In [18]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

In [19]:
with open('mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))

    writer.writerow(('7', '8', '9'))

In [20]:
import json

In [21]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

In [22]:
#read JSON
result=json.loads(obj)

In [23]:
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [24]:
#convert back to json
asjson = json.dumps(result)

In [25]:
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

In [26]:
import pandas as pd
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])

In [27]:
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


**XML and HTML: Web Scraping**

pandas has a built-in function, read_html, which uses libraries like lxml and Beautiful Soup to automatically parse tables out of HTML files as DataFrame objects.

In [28]:
tables=pd.read_html("countries.html")

In [29]:
len(tables)

2

In [30]:
df=pd.DataFrame(tables[0])
df.head()

Unnamed: 0,0,1,2
0,Country orcurrency union,Central bank interest rate (%),Date of last change
1,Albania,1.25,4 May 2016[1]
2,Angola,18.00,30 November 2017[1]
3,Argentina,60.00,5 December 2018[1]
4,Armenia,6.00,14 February 2017[1]


In [31]:
df.columns=df.loc[0]
df.drop([0],inplace=True)

In [32]:
df.head()

Unnamed: 0,Country orcurrency union,Central bank interest rate (%),Date of last change
1,Albania,1.25,4 May 2016[1]
2,Angola,18.0,30 November 2017[1]
3,Argentina,60.0,5 December 2018[1]
4,Armenia,6.0,14 February 2017[1]
5,Australia,1.0,3 July 2019[2]


Using **lxml.objectify**, we parse the file and get a reference to the **root node** of the **XML** file with **getroot**:

In [33]:
# <INDICATOR>
#   <INDICATOR_SEQ>373889</INDICATOR_SEQ>
#   <PARENT_SEQ></PARENT_SEQ>
#   <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
#   <INDICATOR_NAME>Escalator Availability</INDICATOR_NAME>
#   <DESCRIPTION>Percent of the time that escalators are operational
#   systemwide. The availability rate is based on physical observations performed
#   the morning of regular business days only. This is a new indicator the agency
#   began reporting in 2009.</DESCRIPTION>
#   <PERIOD_YEAR>2011</PERIOD_YEAR>
#   <PERIOD_MONTH>12</PERIOD_MONTH>
#   <CATEGORY>Service Indicators</CATEGORY>
#   <FREQUENCY>M</FREQUENCY>
#   <DESIRED_CHANGE>U</DESIRED_CHANGE>
#   <INDICATOR_UNIT>%</INDICATOR_UNIT>
#   <DECIMAL_PLACES>1</DECIMAL_PLACES>
#   <YTD_TARGET>97.00</YTD_TARGET>
#   <YTD_ACTUAL></YTD_ACTUAL>
#   <MONTHLY_TARGET>97.00</MONTHLY_TARGET>
#   <MONTHLY_ACTUAL></MONTHLY_ACTUAL>
# </INDICATOR>

In [34]:
from lxml import objectify

parsed = objectify.parse("ind.xml")
root = parsed.getroot()

**root.INDICATOR** returns a generator yielding each **INDICATOR** tag XML element. 
For each record, we can populate a **dict of tag names** (like YTD_ACTUAL) to data values (excluding a few tags):

In [35]:
data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
               'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

In [36]:
res=pd.DataFrame(data)
res

Unnamed: 0,AGENCY_NAME,CATEGORY,DESCRIPTION,FREQUENCY,INDICATOR_NAME,INDICATOR_UNIT,MONTHLY_ACTUAL,MONTHLY_TARGET,PERIOD_MONTH,PERIOD_YEAR,YTD_ACTUAL,YTD_TARGET
0,Metro-North Railroad,Service Indicators,Percent of the time that escalators are operat...,M,Escalator Availability,%,,97.0,12,2011,,97.0


XML data can get much more complicated than this example. Each tag can have metadata, too. Consider an HTML link tag, which is also valid XML:

In [37]:
from io import StringIO
tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()

In [38]:
root.get("href")

'http://www.google.com'

In [39]:
root.text

'Google'

**Binary Data Formats**
One of the easiest ways to **store data** (also known as **serialization**) efficiently in binary format is using Python’s built-in **pickle serialization**. Pandas objects all have a **to_pickle** method that **writes** the data to **disk in pickle format**:

In [40]:
pd.options.display.max_rows=10
df

Unnamed: 0,Country orcurrency union,Central bank interest rate (%),Date of last change
1,Albania,1.25,4 May 2016[1]
2,Angola,18.00,30 November 2017[1]
3,Argentina,60.00,5 December 2018[1]
4,Armenia,6.00,14 February 2017[1]
5,Australia,1.00,3 July 2019[2]
...,...,...,...
89,Uruguay,-,27 June 2013[1]
90,Uzbekistan,16.00,22 September 2018[1]
91,Vietnam,6.25,7 July 2017[1]
92,West African States,2.50,16 September 2013[1]


In [41]:
df.to_pickle("Country_df_pickle")

In [42]:
pd.read_pickle("Country_df_pickle")

Unnamed: 0,Country orcurrency union,Central bank interest rate (%),Date of last change
1,Albania,1.25,4 May 2016[1]
2,Angola,18.00,30 November 2017[1]
3,Argentina,60.00,5 December 2018[1]
4,Armenia,6.00,14 February 2017[1]
5,Australia,1.00,3 July 2019[2]
...,...,...,...
89,Uruguay,-,27 June 2013[1]
90,Uzbekistan,16.00,22 September 2018[1]
91,Vietnam,6.25,7 July 2017[1]
92,West African States,2.50,16 September 2013[1]


pandas has built-in support for two more binary data formats: ***HDF5 and MessagePack.***

**HDF5** is a well-regarded file format intended for storing large quantities of scientific array data.The “HDF” in HDF5 stands for **hierarchical data format**. Each HDF5 file can store **multiple datasets** and supporting metadata. 

**Interacting with Web APIs**
Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the **requests** package.

To find the last 30 GitHub issues for pandas on GitHub, we can make a GET HTTP request using the add-on requests library:

In [43]:
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp

<Response [200]>

In [44]:
data = resp.json()

data[0]['title']

'Max recursion limit on pd.eval'

In [45]:
issues = pd.DataFrame(data, columns=['number', 'title','labels', 'state'])

In [46]:
issues

Unnamed: 0,number,title,labels,state
0,27639,Max recursion limit on pd.eval,[],open
1,27637,Add comment char parameter to to_csv method,[],open
2,27636,Operators between DataFrame and Series fail on...,[],open
3,27634,pandas.ExcelWriter has abstract methods,"[{'id': 49254273, 'node_id': 'MDU6TGFiZWw0OTI1...",open
4,27633,EA: implement+test EA.view,[],open
...,...,...,...,...
25,27607,BUG: break reference cycle in Index._engine,"[{'id': 8935311, 'node_id': 'MDU6TGFiZWw4OTM1M...",open
26,27602,Disable codecov,"[{'id': 48070600, 'node_id': 'MDU6TGFiZWw0ODA3...",open
27,27599,Complex value counts,"[{'id': 172091424, 'node_id': 'MDU6TGFiZWwxNzI...",open
28,27597,BUG: groupby.transform(name) validates name is...,"[{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...",open
