# Getting data into Python

The data in this example will come from this location:

* [https://en.wikipedia.org/wiki/List_of_wars_by_death_toll](https://en.wikipedia.org/wiki/List_of_wars_by_death_toll)

In [1]:
# here we save the comple link (in quotations):
link='https://en.wikipedia.org/wiki/List_of_wars_by_death_toll'

The operation to get dat from the web requires more than the basic Python functions. Then, you need to install some extra packages:
* pandas
* html5lib
* beautifulsoup4
* lxml
* requests. 

In [2]:
# this code can help you know if you already have the packages listed above:

!pip show pandas html5lib beautifulsoup4 lxml requests 

Name: pandas
Version: 1.4.3
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: c:\users\victorgabriel\anaconda3\envs\fgv\lib\site-packages
Requires: numpy, python-dateutil, pytz
Required-by: geopandas, mapclassify
---
Name: html5lib
Version: 1.1
Summary: HTML parser based on the WHATWG HTML specification
Home-page: https://github.com/html5lib/html5lib-python
Author: 
Author-email: 
License: MIT License
Location: c:\users\victorgabriel\anaconda3\envs\fgv\lib\site-packages
Requires: six, webencodings
Required-by: 
---
Name: beautifulsoup4
Version: 4.11.1
Summary: Screen-scraping library
Home-page: https://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
Author-email: leonardr@segfault.org
License: MIT
Location: c:\users\victorgabriel\anaconda3\envs\fgv\lib\site-packages
Requires: soupsieve
Required-

In [3]:
!python --version

Python 3.9.12


Any package not available can be installed using **pip install _package-name_** . Let's bring the table easily with pandas' *read_html*:

In [6]:
import pandas as pd


wars=pd.read_html(io=link, # reading the info from link
                  flavor='bs4', # the parsing engine # beauty soufl
                  attrs = {'class': 'sortable wikitable'})# how did I know this?

The object **wars** is not a table, but a list (of sortable wiki tables):

In [7]:
# type of object
type(wars)

list

...and we have this many:

In [8]:
len(wars)

3

You can access each element of the list using an index inside '[ ]'. This is what you have in the first element:

In [9]:
type(wars[0]) #the first starts in zero in Python.

pandas.core.frame.DataFrame

The first element of _wars_ is a pandas dataframe. Let's check some info:

In [10]:
wars[0].shape # amount of rows and columns

(24, 6)

In [11]:
wars[0].columns # column names

Index(['War', 'Deathrange', 'Date', 'Combatants', 'Location', 'Notes'], dtype='object')

In [12]:
wars[0].info() # column data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   War         24 non-null     object
 1   Deathrange  24 non-null     object
 2   Date        24 non-null     object
 3   Combatants  24 non-null     object
 4   Location    24 non-null     object
 5   Notes       20 non-null     object
dtypes: object(6)
memory usage: 1.2+ KB


In [13]:
wars[0].head() # or tail()

Unnamed: 0,War,Deathrange,Date,Combatants,Location,Notes
0,Conquests of Cyrus the Great,"100,000+",549 BC–530 BC,Persian Empire vs. various states,Middle East,Number given is the sum of all deaths in battl...
1,Greco–Persian Wars,"300,000+",499 BC–449 BC,Greek City-States vs. Persian Empire,Greece,
2,Samnite Wars,"33,500+",343 BC–290 BC,Roman Republic vs. Samnites,Italy,Number given is the sum of all deaths in battl...
3,Wars of Alexander the Great,"142,000+",336 BC–323 BC,Macedonian Empire and other Greek City-States ...,Middle East / North Africa / Central Asia / India,Number given is the sum of all deaths in battl...
4,Punic Wars,"1,250,000–1,850,000",264 BC–146 BC,Roman Republic vs. Carthaginian Empire,Western Europe / North Africa,


Request the following information:
* The amount of rows in the other data frames in the object _wars_.
* Confirm all those data frames share the same column names.

The same web page from Wikipedia has more information on wars that is more difficult to scrap. I leave the example [HERE](https://colab.research.google.com/drive/1fMva9mhuUiLOlxQ6w7irzz2QVeKTzOXf?usp=sharing) in case you want to know more.

In [14]:
import os

wars[0].to_csv(os.path.join("DataFiles","wars1P.csv"),index=False)
wars[1].to_csv(os.path.join("DataFiles","wars2P.csv"),index=False)
wars[2].to_csv(os.path.join("DataFiles","wars3P.csv"),index=False)