# Working with data
## Extracting rows and columns

***
<br>

## Selecting columns from a pandas DataFrame

* If we know which columns we want before we read the data from the file we can tell `read_csv()` to only import those columns by specifying columns either by their index number (starting at 0) as a list to the `usecols` parameter.
* Alternatively we can also provide a list of column names.

In [1]:
import pandas as pd

df = pd.read_csv("data\FIC.csv", usecols=[0,2,4])
df

Unnamed: 0,Age,Gender,Marital status
0,45,Female,MARRIED
1,51,Female,MARRIED
2,55,Female,MARRIED
3,55,Female,MARRIED
4,56,Female,MARRIED
...,...,...,...
363,55,Male,MARRIED
364,55,Male,MARRIED
365,58,Male,MARRIED
366,58,Male,MARRIED


In [2]:
df = pd.read_csv("data\FIC.csv", usecols=["Age", "Gender", "Depression"])
df

Unnamed: 0,Age,Gender,Depression
0,45,Female,YES
1,51,Female,YES
2,55,Female,YES
3,55,Female,YES
4,56,Female,YES
...,...,...,...
363,55,Male,YES
364,55,Male,YES
365,58,Male,YES
366,58,Male,YES


* Column selection is also possible after all data has been loaded from the file.

In [3]:
df = pd.read_csv("data\FIC.csv")
df = df[["Age", "Gender", "Depression"]]
df

Unnamed: 0,Age,Gender,Depression
0,45,Female,YES
1,51,Female,YES
2,55,Female,YES
3,55,Female,YES
4,56,Female,YES
...,...,...,...
363,55,Male,YES
364,55,Male,YES
365,58,Male,YES
366,58,Male,YES


## Filtering by rows

* You can filter the DataFrame by rows by specifying a range in the form of `a:b`.
* `a` is the first row and `b` is one beyond the last row required.

In [4]:
df = pd.read_csv("data\FIC.csv")
df = df[5:10]
df

Unnamed: 0,Age,Age.Group,Gender,Locality,Marital status,Life.Style,Sleep,Category,Depression,Hyperlipi,...,oldpeak,slope,ca,thal,num,SK,SK.React,Reaction,Mortality,Follow.Up
5,56,51-60,Female,URBAN,MARRIED,NO,NO,FREE,YES,YES,...,1.9,2,2,7,2,1,NO,0,1,32
6,57,51-60,Female,RURAL,MARRIED,YES,YES,PAID,YES,YES,...,0.2,2,0,7,1,1,NO,0,0,60
7,57,51-60,Female,RURAL,MARRIED,NO,NO,FREE,YES,YES,...,0.0,2,1,3,1,1,NO,0,1,3
8,58,51-60,Female,URBAN,MARRIED,NO,NO,FREE,YES,YES,...,0.0,1,2,3,3,1,NO,0,0,15
9,58,51-60,Female,RURAL,MARRIED,YES,YES,FREE,YES,YES,...,2.8,2,2,6,2,1,NO,0,0,6


## Other methods of accessing rows and columns

* `iloc` - referencing by row and column index
* `loc` - referencing by row and column identifier (name)

In [5]:
df = pd.read_csv("data\FIC.csv")

df.iloc[0:3,0:3]

Unnamed: 0,Age,Age.Group,Gender
0,45,41-50,Female
1,51,51-60,Female
2,55,51-60,Female


In [6]:
df.loc[[1,4],["Age", "Gender", "Depression"]]

Unnamed: 0,Age,Gender,Depression
1,51,Female,YES
4,56,Female,YES


## Basic operations on a DataFrame

* Creation of a copy of a DataFrame object
* Operations on columns
* Creating new columns
* Renaming columns

In [7]:
df = pd.read_csv("data\FIC.csv")
df_copy = df[["Age", "Gender", "Depression"]].copy()
df_copy

Unnamed: 0,Age,Gender,Depression
0,45,Female,YES
1,51,Female,YES
2,55,Female,YES
3,55,Female,YES
4,56,Female,YES
...,...,...,...
363,55,Male,YES
364,55,Male,YES
365,58,Male,YES
366,58,Male,YES


* A very convenient thing about pandas, is the ability to perform operations on an entire column.

In [8]:
df_countries = pd.read_csv("data/countries.csv", usecols=["Country","Population"])
print(df_countries[:5])
print()
df_countries['Population'] /= 1000000  # population in millions
print(df_countries[:5])

           Country  Population
0     Afghanistan     31056997
1         Albania      3581655
2         Algeria     32930091
3  American Samoa        57794
4         Andorra        71201

           Country  Population
0     Afghanistan    31.056997
1         Albania     3.581655
2         Algeria    32.930091
3  American Samoa     0.057794
4         Andorra     0.071201


In [9]:
df_countries["New column"] = 1
df_countries[:5]

Unnamed: 0,Country,Population,New column
0,Afghanistan,31.056997,1
1,Albania,3.581655,1
2,Algeria,32.930091,1
3,American Samoa,0.057794,1
4,Andorra,0.071201,1


In [10]:
df_countries = df_countries.rename(columns={"New column":"Other name", "Population":"Population in millions"})
df_countries

Unnamed: 0,Country,Population in millions,Other name
0,Afghanistan,31.056997,1
1,Albania,3.581655,1
2,Algeria,32.930091,1
3,American Samoa,0.057794,1
4,Andorra,0.071201,1
...,...,...,...
222,West Bank,2.460492,1
223,Western Sahara,0.273008,1
224,Yemen,21.456188,1
225,Zambia,11.502010,1


## Iterating over rows

* `DataFrame.iterrows()` - returns consecutive rows in a list
* `DataFrame.itertuples()` - returns consecutive rows in a tuple

In [11]:
for index, row in df_countries.iterrows():
    if row['Population in millions'] > 100:
        df_countries.loc[index, 'Size'] = 'Big'
        print(row['Country'], row['Population in millions'])

Bangladesh  147.365352
Brazil  188.078227
China  1313.973713
India  1095.351995
Indonesia  245.452739
Japan  127.463611
Mexico  107.449525
Nigeria  131.859731
Pakistan  165.80356
Russia  142.89354
United States  298.444215


In [12]:
for row in df_countries.itertuples():
    if row[2] > 200:
        print(row)

Pandas(Index=42, Country='China ', _2=1313.973713, _3=1, Size='Big')
Pandas(Index=94, Country='India ', _2=1095.351995, _3=1, Size='Big')
Pandas(Index=95, Country='Indonesia ', _2=245.452739, _3=1, Size='Big')
Pandas(Index=214, Country='United States ', _2=298.444215, _3=1, Size='Big')


## Selection of rows

* You can make row selections using a list of Boolean values with a size equal to the number of rows. We select those rows that correspond to `True` in the index list.

In [13]:
df_countries = pd.read_csv("data/countries.csv")
df_countries = df_countries[10:20]
df_countries[[True, True, False, True, False, False, False, False, False, False]]

Unnamed: 0,Country,Region,Population,Area sq. mi.,Pop. Density sq. mi.,Coastline coast/area ratio,Net migration,Infant mortality per 1000 births,GDP,Literacy,Phones per 1000,Climate,Agriculture,Industry,Service
10,Aruba,LATIN AMER. & CARIB,71891,193,3725,3549,0,589,28000.0,970,5161,2,4,333,663
11,Australia,OCEANIA,20264082,7686850,26,34,398,469,29000.0,1000,5655,1,38,262,7
13,Azerbaijan,C.W. OF IND. STATES,7961619,86600,919,0,-49,8174,3400.0,970,1371,1,141,457,402


* An index list can be created by formulating a condition.

In [15]:
df_countries = pd.read_csv("data/countries.csv")
df_countries['Population'] /= 1000000

# select countries with population greater than 150 million
df_countries[df_countries['Population']>150]

Unnamed: 0,Country,Region,Population,Area sq. mi.,Pop. Density sq. mi.,Coastline coast/area ratio,Net migration,Infant mortality per 1000 births,GDP,Literacy,Phones per 1000,Climate,Agriculture,Industry,Service
27,Brazil,LATIN AMER. & CARIB,188.078227,8511965,221,9,-3,2961,7600.0,864,2253,2,84,4,516
42,China,ASIA (EX. NEAR EAST),1313.973713,9596960,1369,15,-4,2418,5000.0,909,2667,15,125,473,403
94,India,ASIA (EX. NEAR EAST),1095.351995,3287590,3332,21,-7,5629,2900.0,595,454,25,186,276,538
95,Indonesia,ASIA (EX. NEAR EAST),245.452739,1919440,1279,285,0,356,3200.0,879,520,2,134,458,408
156,Pakistan,ASIA (EX. NEAR EAST),165.80356,803940,2062,13,-277,7244,2100.0,457,318,1,216,251,533
214,United States,NORTHERN AMERICA,298.444215,9631420,310,21,341,65,37800.0,970,8980,3,1,204,787


In [16]:
# select countries with population greater than 100 million and from outside Asia
df_countries[(df_countries.Population>100) & (~df_countries.Region.str.contains('ASIA'))]

Unnamed: 0,Country,Region,Population,Area sq. mi.,Pop. Density sq. mi.,Coastline coast/area ratio,Net migration,Infant mortality per 1000 births,GDP,Literacy,Phones per 1000,Climate,Agriculture,Industry,Service
27,Brazil,LATIN AMER. & CARIB,188.078227,8511965,221,9,-3,2961,7600.0,864,2253,2.0,84,4,516
135,Mexico,LATIN AMER. & CARIB,107.449525,1972550,545,47,-487,2091,9000.0,922,1816,15.0,38,259,702
152,Nigeria,SUB-SAHARAN AFRICA,131.859731,923768,1427,9,26,988,900.0,680,93,15.0,269,487,244
169,Russia,C.W. OF IND. STATES,142.89354,17075200,84,22,102,1539,8900.0,996,2806,,54,371,575
214,United States,NORTHERN AMERICA,298.444215,9631420,310,21,341,65,37800.0,970,8980,3.0,1,204,787


## --- Exercise ---

Calculate how many countries there are in the `data\countries.csv` file with a GDP between 10000 and 20000 and a population of more than 5 million.

In [None]:
# Write you code here