# Scrape cases To MySQL DB
Web-scraping is a way to generate dataset from websites. This notebook is a walkthrough for:
1. Scraping from legal cases of kenya from https://www.cases.sheriahub.com/ke/
2. Store scraped data into MySQL DB.

**Tools:** BeautifulSoup, requests

# Get links from the main page

In [14]:
import requests
from bs4 import BeautifulSoup as soup
import datetime
import random

In [15]:
def getPage(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup

In [16]:
# page intrested to script
website = 'https://www.cases.sheriahub.com/ke/'

# html parsing
soup = getPage(website)

# grap all case
case_html = soup.findAll("div",{"class":"posttext pull-left"})
pdfs = soup.findAll("a",{"class":"btn btn-primary pull-right"})

In [28]:
# List of variables
case_titles = []
case_urls = []
case_pdfs = []

for url in range(len(case_html)):
    case_titles.append(case_html[url].a.text.strip())
    case_urls.append('https://www.cases.sheriahub.com/'+case_html[i].a["href"])
    case_pdfs.append('https://www.cases.sheriahub.com/'+pdfs[i]["href"])

In [30]:
#### save to txt
# write caselinks in text file, line by line
with open('case_urls.txt','w') as the_file:
    for case in range(len(case_urls)):  
        the_file.write(case_urls[case]+'\n')

In [29]:
# check scraped urls
case_urls

['https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/',
 'https://www.cases.sheriahub.com/case/11f09f49134e61d1a4aa1c44457c1b/']

# Get case info

In [90]:
pause_time = 5 # Websites have a capacity to handle requests, making some delayes would help the website not to crash
for url in range(len(case_urls)):
    caseTableAtt = [] # list for columns of metadata table, clear the list
    caseTableValue = []# list for value of metadata table, clear the list
    metadatas = [] # list of metadata, clear the list
    case_judgement_html_List = []
    case_judgement_text_List = []
    # To get metadata inside the case
    soup = getPage(case_urls[url])
        
    # case_text
    case_judgement_html_List.append(soup.find("div",{"style":"text-align:justify;text-justify:inter-word;border-top:solid px #f1f1f1;padding-top:1em;margin-top:1em"}))
    case_judgement_text_List.append(case_judgement_html.get_text().strip())
    
    # get date of scraping
    date = datetime.datetime.now()
    date = date.strftime("%y-%m-%d %H.%M.%S")
    dateList.append(date)
    
    
    # To extract data from table (e.g. metadata)
    caseTables = soup.findAll("td")
    j=0
    while j<22:# 22 is number of tr of caseTables, fixed becuase some cases also have tr in the text
        caseTableAtt.append(caseTables[j].text)
        j+=1
        caseTableValue.append(caseTables[j].text.replace("\n",""))
        j+=1
        metadatas = dict(zip(caseTableAtt, caseTableValue))
    metadataJsonList.append(json.dumps(metadatas))
    time.sleep(pause_time)

### Check stored data

In [91]:
case_titles[0]

'Miscellaneous Criminal Application E149 of 2021 - Omari Mwinyi Mwapita v Republic'

In [92]:
metadataJsonList[0]

'{"Case Number": "Miscellaneous Criminal Application E149 of 2021", "Parties": "Omari Mwinyi Mwapita v Republic", "Case Class": "Criminal", "Judges": "Grace Lidembu Nzioka ", "Advocates": "Mr. Kagoi for the applicantMs. Akunja for the Respondent.", "Case Action": "Ruling", "Case Outcome": "Application dismissed", "Date Delivered": "30 Dec 2021", "Court County": "Nairobi", "Case Court": "High Court at Nairobi (Milimani Law Courts)", "Court Division": "Criminal"}'

In [93]:
case_judgement_text_List[0][:100]

'REPUBLIC OF KENYAIN THE HIGH COURT OF KENYA AT NAIROBIHIGH COURT MISCELLANEOUS CRIMINAL APPLICATION '

Data has been successfully been scraped!

# Store to DB

The database already been created and has a table with the following rows:
1. id – int - This is autoincrement so you will not insert anything.
2. case_title – varchar
3. metadata – json
4. case_judgement_html – MEDIUMTEXT
5. case_judgement_text– MEDIUMTEXT
6. pdf – LONGBLOB
7. date - datetime


## Connect to mySQL DB
Trying to connect to MySQL DB and save data their

In [8]:
import mysql.connector

db = mysql.connector.connect(host= x ,
                                         database=x,
                                         user=x,
                                         password=x)


In [9]:
mycursor = db.cursor()
mycursor.execute("DESCRIBE ziyad_alshawi")
for x in mycursor:
    print(x)

('id', 'int(11)', 'NO', 'PRI', None, 'auto_increment')
('case_title', 'varchar(225)', 'NO', '', None, '')
('metadata', 'json', 'NO', '', None, '')
('case_judgement_html', 'mediumtext', 'NO', '', None, '')
('case_judgement_text', 'mediumtext', 'NO', '', None, '')
('pdf', 'blob', 'NO', '', None, '')
('date', 'datetime', 'NO', '', None, '')


In [10]:
# insert into the database
for i in range(10):
    query = '''INSERT INTO `ziyad_alshawi` (`case_title`,`metadata`,`case_judgement_html`,`case_judgement_text`,`pdf`,`date`) VALUES ('{}','{}',"{}","{}","{}","{}")'''.format(case_titles[i],metadataJsonList[i] ,case_judgement_html_List[i],case_judgement_text_List[i],pdf[i],dateList[i])
    mycursor.execute(query)
    db.commit()

# Lesson Learned
- Data extraction is a powerful tool to have as a data scientist.
- In DB records/rows need to enter in one query.
- In DB IDs usualy set on auto-increment to clear that need to use `truncate table table_name;` in MySQL Console.
- When extrate from text may encounter with `\n` inside the text which means in MySQL  ASCII 0x10 (newline) doesn't mean `\` + `n` expression. [source](https://stackoverflow.com/questions/47504402/mysql-invalid-json-text-invalid-escape-character-in-string)