# A more complex example on scrapping

The previous code did not bring the data on wars that was as a list  [https://en.wikipedia.org/wiki/List_of_wars_by_death_toll](https://en.wikipedia.org/wiki/List_of_wars_by_death_toll#Modern_wars_with_fewer_than_25,000_deaths_by_death_toll). Since this is not a table, it requires more effort:

In [1]:
# previous link:
link='https://en.wikipedia.org/wiki/List_of_wars_by_death_toll'

# call the packages
import requests
from bs4 import BeautifulSoup as soup
  
HEADERS = ({'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) \
                          AppleWebKit/537.36 (KHTML, like Gecko) \
                          Chrome/44.0.2403.157 Safari/537.36'}) # what's this??
  
# creating request object
req = requests.get(link, headers=HEADERS)
  
req.content

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of wars by death toll - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"4122314b-d61a-4a14-be54-ef6808183789","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_wars_by_death_toll","wgTitle":"List of wars by death toll","wgCurRevisionId":1104689488,"wgRevisionId":1104689488,"wgArticleId":15459916,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Harv and Sfn no-target errors","CS1 Chinese-language sources (zh)","CS1 French-language sources (fr)","Web

The previous result is still difficult to read. That's why we need beautiful soup.

In [2]:
# creating soup object
dataFull = soup(req.content, features="html.parser")

# you get a more organised content
dataFull

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of wars by death toll - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"4122314b-d61a-4a14-be54-ef6808183789","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_wars_by_death_toll","wgTitle":"List of wars by death toll","wgCurRevisionId":1104689488,"wgRevisionId":1104689488,"wgArticleId":15459916,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Harv and Sfn no-target errors","CS1 Chinese-language sources (zh)","CS1 French-language sources (fr)","Webarchiv

You can apply **bs** functions to _dataFull_ now. Let's request all the **ul**s (unordered list):

In [3]:
# finding all ul tags 
dataULs = dataFull.findAll('ul')
dataULs

[<ul><li><a href="/wiki/Prehistoric_warfare" title="Prehistoric warfare">Prehistoric</a></li>
 <li><a href="/wiki/Ancient_warfare" title="Ancient warfare">Ancient</a></li>
 <li><a href="/wiki/Medieval_warfare" title="Medieval warfare">Post-classical</a></li>
 <li><a href="/wiki/Early_modern_warfare" title="Early modern warfare">Early modern</a></li>
 <li><a href="/wiki/Modern_warfare" title="Modern warfare">Late modern</a>
 <ul><li><a href="/wiki/Industrial_warfare" title="Industrial warfare">industrial</a></li>
 <li><a href="/wiki/Fourth-generation_warfare" title="Fourth-generation warfare">fourth-gen</a></li></ul></li></ul>,
 <ul><li><a href="/wiki/Industrial_warfare" title="Industrial warfare">industrial</a></li>
 <li><a href="/wiki/Fourth-generation_warfare" title="Fourth-generation warfare">fourth-gen</a></li></ul>,
 <ul><li>Aerospace
 <ul><li><a href="/wiki/Aerial_warfare" title="Aerial warfare">Air</a></li>
 <li><a href="/wiki/Airborne_forces" title="Airborne forces">Airborne</a

What do we have?

In [4]:
type(dataULs)

bs4.element.ResultSet

An **ul** has **li**s, but we have this many **ul**s:

In [5]:
len(dataULs)

78

We have 78 unordered list, and we need to find only one of those. The strategy is to know its **XPATH** and recover each war in the **ul** (each **li** element).

In [6]:
from lxml import etree

#turn soup into string, and let it be organized as a tree by lxml:
allInfo = etree.HTML(str(dataFull))

# find the path to the 'ul':

the_X_PATH='//*[@id="mw-content-text"]/div[1]/ul[1]'
pathTo_liS= the_X_PATH + '/li'


# see the result

allInfo.xpath(pathTo_liS)

[<Element li at 0x226555feac0>,
 <Element li at 0x226555fed80>,
 <Element li at 0x226555fec80>,
 <Element li at 0x226555feb80>,
 <Element li at 0x226555fe940>,
 <Element li at 0x226555febc0>,
 <Element li at 0x226555fed40>,
 <Element li at 0x226555fe7c0>,
 <Element li at 0x226555fe9c0>,
 <Element li at 0x226555fe880>,
 <Element li at 0x226555fe980>,
 <Element li at 0x226555fef40>,
 <Element li at 0x226555fef00>,
 <Element li at 0x226555fe900>,
 <Element li at 0x226555fe580>,
 <Element li at 0x226555fea80>,
 <Element li at 0x226555fee00>,
 <Element li at 0x22654e0c080>,
 <Element li at 0x22653753100>,
 <Element li at 0x226556051c0>,
 <Element li at 0x22655605100>,
 <Element li at 0x22655605bc0>,
 <Element li at 0x22655605040>,
 <Element li at 0x22655605d00>,
 <Element li at 0x22655605c40>,
 <Element li at 0x226556055c0>,
 <Element li at 0x226556058c0>,
 <Element li at 0x22655605780>,
 <Element li at 0x22655605e80>,
 <Element li at 0x22655605fc0>,
 <Element li at 0x22655605f80>,
 <Elemen

You have now all the **li** from the **ul** element.

In [7]:
[i.text for i in allInfo.xpath(pathTo_liS)]

['22,000+ – ',
 '22,211 – ',
 '21,000+ – ',
 '20,068 - ',
 '20,000+ – ',
 '20,000+ – ',
 '20,000+ – ',
 '20,000+ – ',
 '19,619+ – ',
 '19,000+ – ',
 '18,069–20,069 – ',
 '17,294+ – ',
 '17,200+ – ',
 '16,765–17,065 – ',
 '16,000+ – ',
 '16,000+ – ',
 '16,000+ – ',
 '15,200–15,300 – ',
 '15,000+ – ',
 '14,460–14,922 – ',
 '14,077–22,077 – ',
 '13,929+ – ',
 '13,812+ – ',
 '13,100–34,000 – ',
 '13,100–13,300 – ',
 '13,073–26,373 – ',
 '11,500–12,843 – ',
 '10,000+ – ',
 '10,000+ – ',
 '10,000+ – ',
 '10,000+ – ',
 '10,000+ – ',
 '10,000+ – ',
 '10,000+ – ',
 '10,000+ – ',
 '9,400+ – ',
 '8,136+ – ',
 '7,500–21,741 – ',
 '7,400–16,200 – ',
 '7,050+ - ',
 '7,104+ – ',
 '7,000+ – ',
 '6,800–13,459 – ',
 '6,859+ – ',
 '5,641–6,991 - ',
 '6,543+ – ',
 '6,295+ – ',
 '5,641+ – ',
 '5,100+ – ',
 '5,000+ – ',
 '5,000+ – ',
 '5,000+ - ',
 '4,715+ – ',
 '4,200+ – ',
 '4,000–10,000 – ',
 '3,699+ – ',
 '3,552+ – ',
 '3,529+ – ',
 '3,366+ – ',
 '3,270+ – ',
 '3,222–3,722 – ',
 '3,144+ – ',
 '3,114+ – 

The previous result is not what you need. But this is the right one using  **itertext()**:

In [8]:
all_LIs=[list(i.itertext()) for i in allInfo.xpath(pathTo_liS)] # iter mean iteractive
all_LIs

[['22,000+ – ',
  'Dominican Restoration War',
  ' – One estimate placed total Spanish deaths from all causes at 18,000. The fatal losses among the Dominican insurgents were estimated at 4,000. (1863–1865)',
  '[30]'],
 ['22,211 – ', 'Croatian War of Independence', ' (1991–1995)', '[131]'],
 ['21,000+ – ', 'Six-Day War', ' (1967)', '[132]'],
 ['20,068 - ', 'Reform War', ' (1857–1860)'],
 ['20,000+ – ', 'Yaqui Wars', ' (1533–1929)', '[23]'],
 ['20,000+ – ', 'War of the Quadruple Alliance', ' (1718–1720)', '[30]'],
 ['20,000+ – ', 'Ragamuffin War', ' (1835–1845)', '[133]'],
 ['20,000+ – ', 'Italo-Turkish War', ' (1911–1912)', '[23]'],
 ['19,619+ – ', 'Rhodesian Bush War', ' (1964–1979)'],
 ['19,000+ – ', 'Mexican–American War', ' (1846–1848)', '[23]'],
 ['18,069–20,069 – ', 'First Opium War', ' (1839–1842)', '[134]'],
 ['17,294+ – ', '1940–44 insurgency in Chechnya', ' (1940–1944)'],
 ['17,200+ – ', 'First Anglo-Afghan War', ' (1839–1842)', '[135]'],
 ['16,765–17,065 – ',
  'Balochistan 

Notice that we can access each list element:

In [9]:
all_LIs[0]

['22,000+ – ',
 'Dominican Restoration War',
 ' – One estimate placed total Spanish deaths from all causes at 18,000. The fatal losses among the Dominican insurgents were estimated at 4,000. (1863–1865)',
 '[30]']

Each is a row of information. So we can turn each element into a data frame row using Pandas:

In [10]:
import pandas as pd

all_LIs_DF=pd.DataFrame(all_LIs)
all_LIs_DF

Unnamed: 0,0,1,2,3,4,5,6
0,"22,000+ –",Dominican Restoration War,– One estimate placed total Spanish deaths fr...,[30],,,
1,"22,211 –",Croatian War of Independence,(1991–1995),[131],,,
2,"21,000+ –",Six-Day War,(1967),[132],,,
3,"20,068 -",Reform War,(1857–1860),,,,
4,"20,000+ –",Yaqui Wars,(1533–1929),[23],,,
...,...,...,...,...,...,...,...
172,11 –,Great Nordic Biker War,(1994–1997),,,,
173,8 –,2011 India–Pakistan border skirmish,(2011),,,,
174,7 –,Bellevue War,(1840),,,,
175,2+ –,Ontario Biker War,(1999–2002),,,,



The previous data frames included location. Since this last one does not have it, we could:
1. Get the URL for every war.
2. Visit the URL.
3. Find the **location** in that page.
4. Make a list with the valid locations. 
5. Creat a new column with that list.

Getting the links requires changing the XPATH:



In [11]:
wiki='https://en.wikipedia.org' 

links=[wiki+str(i) for i in allInfo.xpath(pathTo_liS + '/a[1]/@href')] # change

## see some:
links[15:20]

['https://en.wikipedia.org/wiki/Nepalese_Civil_War',
 'https://en.wikipedia.org/wiki/Spanish%E2%80%93American_War',
 'https://en.wikipedia.org/wiki/Peasants%27_War_(1798)',
 'https://en.wikipedia.org/wiki/Nigerian_Sharia_conflict',
 'https://en.wikipedia.org/wiki/South_African_Border_War']

Notice that if you click the third link shown above will not work from well; but it  can still be used:

In [12]:
links[17]

'https://en.wikipedia.org/wiki/Peasants%27_War_(1798)'

Let me get the location for this link:

In [13]:

# creating request object
req = requests.get(links[17], headers=HEADERS) #here is the link!!

# creating soup object
dataFull = soup(req.content, features="html.parser")

# you get the location
dataFull.findAll('div', {'class': 'location'})[0].text


'Southern Netherlands annexed by the French Republic[b]'

I will use every link now:

In [None]:
import requests
from bs4 import BeautifulSoup as soup

locations=[]

for URL in links:  
  # creating request object
  req = requests.get(URL, headers=HEADERS)
  
  # creating soup object
  dataFull = soup(req.content, features="html.parser")

# you get all the html from the link
  try:
    location=dataFull.findAll('div', {'class': 'location'})[0]
    locations.append(location.text)
  except:
    locations.append(None) # None if it No location available
  

In [None]:
# you got:
len(locations)

Let's just add those links and locations as another columns to our previous data frame:

In [None]:
all_LIs_DF['url']=links
all_LIs_DF['Location']=locations

In [None]:
all_LIs_DF.head()

In [None]:
all_LIs_DF.columns

In [None]:
newNames=['Deathrange','War','Date','Notes1','Notes2','Notes3', 'Notes4','url','Location']
all_LIs_DF.columns=newNames

In [None]:
all_LIs_DF

In [None]:
ToAGG=['Notes'+ str(i) for i in range(1,5)]
ToAGG

In [None]:
all_LIs_DF[ToAGG].fillna('').sum(axis=1)

In [None]:
#then
all_LIs_DF['Notes']=all_LIs_DF[ToAGG].fillna('').sum(axis=1)

#so we can drop all the original 'notes'
all_LIs_DF.drop(ToAGG,axis=1,inplace=True)
all_LIs_DF