# AICE1006 - Data Analytics

## Lecture 4 - Data Processing


**Zhiwu Huang**  <br/>
Lecturer (Assistant Professor) <br/>
Vision, Learning and Control (VLC) Research Group <br/>
School of Electronics and Computer Science (ECS) <br/>
University of Southampton<br/>

*Office Hour: Wed 2PM-3PM, Please book in advance.* <br/>
``Zhiwu.Huang@soton.ac.uk``

<br/>
<br/>
<!-- <br/> -->

Credit: Marco Forgione, Researcher, USI-SUPSI


<!-- The workhorse of numerical mathematics and machine learning in Python -->



<!-- # Data processing and cleaning with Pandas -->

<!-- ### Basic Processing and Cleaning -->

<!-- ## Marco Forgione -->



A helper class for this lecture:

In [1]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New",Courier,monospace'>{0}</p>{1}
    </div>"""
    def __init__(self,*args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a,eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

Taken from https://github.com/jakevdp/PythonDataScienceHandbook. Useful to display several pandas dataframes in one line... 

### Pandas recap

The pandas module provides: 

* a 2D labeled data structure: ``pd.DataFrame``
* a 1D labeled data structure: ``pd.Series``

We typically read a dataframe from an external source (e.g. CSV file) and perform manipulations on it. <!-- ## Here,let us work on artificial data. -->

In [2]:
# usual imports...
import pandas as pd
import numpy as np

In [3]:
# create artificial data (normally we read from an external source...)
data = pd.DataFrame(np.random.randn(3, 2), columns=["A", "B"], index=["a", "b", "c"]) #  = pd.read_csv("filename.csv")
data # this is a pandas DataFrame

Unnamed: 0,A,B
a,1.166576,-0.478198
b,-0.281521,0.980107
c,-0.541647,0.35251


In [4]:
data["B"] # this is a Series

a   -0.478198
b    0.980107
c    0.352510
Name: B, dtype: float64

In [5]:
data.loc["c"] # this is also a Series...

A   -0.541647
B    0.352510
Name: c, dtype: float64

We have seen how to select rows/columns and filter according to logical conditions (find city in Switzerland...) in Lecture 3. <br/>
Today, we do some basic processing!

### Unary Operations

**Unary**  operations from ``numpy`` can be applied **element-wise** to all the elements of a series or dataframe. 

In [6]:
data # a dataframe with random data, from the previous slide

Unnamed: 0,A,B
a,1.166576,-0.478198
b,-0.281521,0.980107
c,-0.541647,0.35251


In [7]:
np.sin(data["A"]) # np.sin function applied to the series data["A"]

a    0.919409
b   -0.277817
c   -0.515548
Name: A, dtype: float64

In [8]:
data_sin = np.sin(data) # np.sin also works on the dataframe...
data_exp = np.exp(data) # np.sin,np.cos,np.exp,np.log
data_half_sq = data**2/2.0 # broadcasting of scalar value 2.0 to all elements of the dataframe also works as expected
display('data','data_sin','data_exp','data_half_sq') # the display class that we defined in slide 2...

Unnamed: 0,A,B
a,1.166576,-0.478198
b,-0.281521,0.980107
c,-0.541647,0.35251

Unnamed: 0,A,B
a,0.919409,-0.46018
b,-0.277817,0.830557
c,-0.515548,0.345254

Unnamed: 0,A,B
a,3.210978,0.6199
b,0.754635,2.664741
c,0.581789,1.422633

Unnamed: 0,A,B
a,0.680449,0.114336
b,0.039627,0.480305
c,0.146691,0.062131


### Unary Operations

Generic unary operations may be applied element-wise via the **map** (for ``pd.Series``) and **applymap** (for ``pd.DataFrame``) methods:

In [9]:
data = pd.DataFrame(np.random.randint(0, 10, size=(3, 2)), columns = ["A", "B"])
data

Unnamed: 0,A,B
0,5,2
1,0,7
2,4,3


In [10]:
# The element-wise operation that we want to apply to our data
def squared_plus1(x): 
    return x**2 + 1

In [11]:
data["A"].apply(squared_plus1) # apply squared_plus1 to all elements of the series. Note: data["A"] is a pd.Series!
# equivalent to data["A"]**2 + 1

0    26
1     1
2    17
Name: A, dtype: int64

In [12]:
data_1 = data.applymap(squared_plus1) # apply squared_plus1 to all elements of the df
display('data', 'data_1')

  data_1 = data.applymap(squared_plus1) # apply squared_plus1 to all elements of the df


Unnamed: 0,A,B
0,5,2
1,0,7
2,4,3

Unnamed: 0,A,B
0,26,5
1,1,50
2,17,10


Anonymous *lambda* functions may also be used:

In [13]:
data["A"].apply(lambda x: x**2 + 1); # same as data["A"].apply(squared_plus1)

### Binary operations

Things get more interesting for binary operations. Pandas automatically **aligns data according to the index** of a Series:

In [14]:
area = pd.Series({'Alaska': 1_723_337,'Texas': 695_662,'California': 423_967}, name='area')
population = pd.Series({'California': 38_332_521,'Texas': 26_448_193,'New York': 19_651_127}, name='population')

In [15]:
area

Alaska        1723337
Texas          695662
California     423967
Name: area, dtype: int64

In [16]:
population

California    38332521
Texas         26448193
New York      19651127
Name: population, dtype: int64

In [17]:
density = population/area

In [18]:
density

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

* The result includes the indices of the two series (Alaska, California, New York, Texas)
* For the common indices (California, Texas), the operation is actually performed
* For the indices that are not in both series (Alaska, New York), the result is ``np.nan``

### Binary Operations

A similar alignment mechanism applies to dataframes. Data is aligned according to both row and column indices

In [19]:
df1 = pd.DataFrame(np.random.randint(0, 10, size=(4,3)), columns=["A","B","C"], index=['a','b','c','d'])
df2 = pd.DataFrame(np.random.randint(0, 10, size=(3,3)), columns=["B","C","D"], index=['a','b','c'])
df_sum = df1 + df2
display('df1','df2','df_sum')

Unnamed: 0,A,B,C
a,4,7,6
b,7,3,2
c,8,6,9
d,2,9,2

Unnamed: 0,B,C,D
a,7,2,5
b,0,7,2
c,7,4,2

Unnamed: 0,A,B,C,D
a,,14.0,8.0,
b,,3.0,9.0,
c,,13.0,13.0,
d,,,,


* The result ``df_sum=df1+df2`` includes rows and column indices from both ``df1`` and ``df2``
* Columns A and D in ``df_sum`` are ``np.nan``
* Row ``d`` in ``df_sum`` is np.nan

The binary operation is applied to all the elements where it makes sense to do it.

### Add a new column

We can add a new column to an existing dataframe based using the dictionary-like syntax: ``data["key"] = 1D array_like``. <br/>
The 1D array_like may be a series obtained with unary/binary operations on existing columns.

In [20]:
data = pd.DataFrame(np.random.randn(2, 2), columns=["X", "Y"])
data

Unnamed: 0,X,Y
0,1.049488,-1.381459
1,1.073409,0.468175


In [21]:
data["R"] = data["X"]**2 + data["Y"]**2 # assign data["R"] to a pd.Series using dictionaty-like syntax
data

Unnamed: 0,X,Y,R
0,1.049488,-1.381459,3.009855
1,1.073409,0.468175,1.371396


In [22]:
data["A"] = ["a", "b"] # assign a list to a new column
data["B"] = np.arange(2) # assign a numpy array to a new column
data

Unnamed: 0,X,Y,R,A,B
0,1.049488,-1.381459,3.009855,a,0
1,1.073409,0.468175,1.371396,b,1


We can speficy the position for the new column using the ``insert`` dataframe method:

In [23]:
data.insert(3, "W", data["X"]/data["R"] ) # insert in position 2 (third column),with name W
data

Unnamed: 0,X,Y,R,W,A,B
0,1.049488,-1.381459,3.009855,0.348684,a,0
1,1.073409,0.468175,1.371396,0.782713,b,1


### Aggregation methods for series

* Statistical operations such as ``mean``, ``min``, ``max``, ``std``, are built-in methods of ``pd.Series``.

In [24]:
population = pd.Series({'Bellinzona': 17_744,'Lugano': 62_615,'Mendrisio': 11_554,'Stabio': 4_510,'Lausanne': 140_000,'Bern': 133_115}, name='population')

In [25]:
population.mean(),population.min(),population.max(),population.std()

(61589.666666666664, 4510, 140000, 61561.68250029126)

The methods above are **aggregations**: they transform a sequence (``pd.Series``) to a scalar (``float`` or ``int``). <br/>


* The series methods ``sum`` and ``count`` are other common and useful aggregations. 

In [26]:
population.count(), population.sum() # number of cities,sum of the population of the cities

(6, 369538)

### Aggregation methods for dataframes
The aggregations ``mean``, ``min``, ``max``, ``std``, ``count``, ``sum`` are also built-in for dataframes. They are applied either row-wise or column-wise

In [27]:
df = pd.DataFrame(np.random.randint(0, 10, size=(3, 2)), columns = ["A", "B"], index=['a', 'b', 'c'])
df

Unnamed: 0,A,B
a,7,8
b,7,8
c,1,9


By default, the operations are applied for each column. The aggregation is done over the elements of the rows:

In [28]:
col_sum = df.sum() # df.sum(axis="index") by default -> compute sum for each column
col_sum

A    15
B    25
dtype: int64

Thus, ``col_sum["A"]`` is the sum of the elements of column A, computed over rows a, b, c.

By specifying the option ``axis=columns``, the operations are computed for each row, to the elements of the columns:

In [29]:
row_sum = df.sum(axis="columns")
row_sum

a    15
b    15
c    10
dtype: int64

Thus, ``row_sum["a"]`` is the sum of row a, computed over columns A,B,C.

### The describe method

The ``describe`` dataframe method summarizes several **column statistics** and returns a dataframe. It is a useful tool for preliminary **exploratory analysis**.

In [30]:
data = pd.DataFrame(np.random.randn(4, 2), columns=["X", "Y"])
data["Z"] = ["a", "b", "c", "d"]
data

Unnamed: 0,X,Y,Z
0,-0.085843,-0.275827,a
1,-0.733603,-0.142095,b
2,-0.737188,0.856779,c
3,-0.735771,-0.288423,d


In [31]:
data.describe() # dataframe summary statistics

Unnamed: 0,X,Y
count,4.0,4.0
mean,-0.573101,0.037609
std,0.324842,0.550113
min,-0.737188,-0.288423
25%,-0.736125,-0.278976
50%,-0.734687,-0.208961
75%,-0.571663,0.107624
max,-0.085843,0.856779


Describe is also available for a series

In [32]:
data['Z'].describe() # series summary statistics

count     4
unique    4
top       a
freq      1
Name: Z, dtype: object

NOTE: for a dataframe, ``describe`` works for numeric columns only. Non-numeric columns are automatically skipped...

### Custom row- and column-wise dataframe operations


Custom row-wise or column-wise operations can be implemented via the dataframe ``apply`` method. 
It may be used to compute custom statistics.

In [33]:
df = pd.DataFrame(np.random.rand(3, 2), columns=["A", "B"], index=['a', 'b', 'c'])
df

Unnamed: 0,A,B
a,0.291669,0.605759
b,0.826059,0.403415
c,0.425681,0.533858


In [34]:
def numerical_range(x):
    return x.max() - x.min()

In [35]:
df.apply(numerical_range, axis="index") # numerical range, by columns
# numerical_range is an aggregation function. Thus, when applied to a dataframe, it will return a Series...

A    0.534391
B    0.202343
dtype: float64

In [36]:
df.apply(numerical_range, axis="columns") # numerical range,by rows

a    0.314090
b    0.422644
c    0.108177
dtype: float64

Note: the function ``numerical_range`` expects a ``pd.Series`` corresponding to a column (or a row) of the dataframe. 


Anonymous *lambda* functions can also be used:

In [37]:
df.apply(lambda x: x["a"] + x["b"] - x["c"]) 

A    0.692047
B    0.475316
dtype: float64

In [38]:
df.apply(lambda x: np.sqrt(x["A"]**2 + x["B"]**2),axis="columns") # can also assign it to new column: df["C"] = df.apply(lambda row: np.sqrt(row["A"]**2 + row["B"]**2),axis="columns")

a    0.672320
b    0.919303
c    0.682795
dtype: float64

### Sorting

Dataframes may be sorted by column values...

In [39]:
df = pd.DataFrame(np.random.randint(0, 10, size=(4, 2)), columns=["A", "B"], index=['a', 'd', 'c', 'b'])
df_sort_A_asc = df.sort_values(by=["A"]) # default: ascending order (from small to large)
df_sort_A_dsc = df.sort_values(by=["A"], ascending=False) # (from large to small)
display('df', 'df_sort_A_asc', 'df_sort_A_dsc')

Unnamed: 0,A,B
a,1,7
d,6,1
c,0,0
b,3,9

Unnamed: 0,A,B
c,0,0
a,1,7
b,3,9
d,6,1

Unnamed: 0,A,B
d,6,1
b,3,9
a,1,7
c,0,0


...or by index value

In [40]:
df.sort_index() # sort dataframe according to its index (alphabetical order)

Unnamed: 0,A,B
a,1,7
b,3,9
c,0,0
d,6,1


### Inplace operations

Sorting may be performed inplace setting the ``inplace`` option to True. Then:
 * The original dataframe **is modified**
 * The return value is ``None``

In [41]:
df = pd.DataFrame(np.random.randint(0, 10, size=(6, 2)), columns=["A", "B"])
df

Unnamed: 0,A,B
0,8,8
1,1,1
2,2,4
3,5,4
4,8,7
5,8,7


In [42]:
ret_val = df.sort_values(by=["A"], inplace=True)
df

Unnamed: 0,A,B
1,1,1
2,2,4
3,5,4
0,8,8
4,8,7
5,8,7


In [43]:
ret_val is None

True

* Several numpy/pandas methods (typically data transformations) have an ``inplace`` option. Look up in the documentation!
* On large datasets, inplace operations may be faster/save memory. 

### Counting

Common tasks: check distinct values of a series, count their occurrences.

In [44]:
ser = pd.Series(np.array([0, 0, 0, 1, 1, 0, 0, 1, 0, 0]), name="bits")
ser

0    0
1    0
2    0
3    1
4    1
5    0
6    0
7    1
8    0
9    0
Name: bits, dtype: int64

In [45]:
ser.count() # the series contains 10 elements

10

In [46]:
ser.unique() # the series contains the distinct values 0 and 1

array([0, 1])

In [47]:
ser.value_counts() # the value 0 occurs 7 times,the value 1 occurs 3 times

bits
0    7
1    3
Name: count, dtype: int64

Useful to understand whether a variable is numerical (in principle, infinite number of possible values) or categorical (finite possible values).

### String manipulation

Series of string objects have a ``str`` attribute containing useful methods for string manipulation.

In [48]:
df = pd.DataFrame(np.random.randint(1, 10, size=(3, 4)), columns=["Area", "Population", "Latitude", "Longitude"])
df.insert(0, "City", ["Lugano", "Geneva", "Zurich"]) # insert column at position 0
df

Unnamed: 0,City,Area,Population,Latitude,Longitude
0,Lugano,7,5,9,3
1,Geneva,5,9,6,4
2,Zurich,8,6,7,2


In [49]:
df['City'].str

<pandas.core.strings.accessor.StringMethods at 0x16e857a30>

Example: convert to uppercase/ to lowecase

In [50]:
df['City'] = df['City'].str.upper() # to uppercase
df

Unnamed: 0,City,Area,Population,Latitude,Longitude
0,LUGANO,7,5,9,3
1,GENEVA,5,9,6,4
2,ZURICH,8,6,7,2


In [51]:
df['City'] = df['City'].str.lower() # to lowercase
df

Unnamed: 0,City,Area,Population,Latitude,Longitude
0,lugano,7,5,9,3
1,geneva,5,9,6,4
2,zurich,8,6,7,2


Many more methods available. Look up in the documentation!

### Date manipulation

Dates may be simply represented as strings:

In [52]:
df = pd.DataFrame({"date": ["01/12/2020", "02/01/2021", "03/02/2021", "04/03/2021"], "val": np.random.randn(4)})
df

Unnamed: 0,date,val
0,01/12/2020,1.161897
1,02/01/2021,0.300429
2,03/02/2021,0.967956
3,04/03/2021,-1.492499


However, it is convenient to represent dates with a specific data type. Pandas can parse several date formats. A format string may be given as argument.

In [53]:
df["date"] = pd.to_datetime(df["date"], dayfirst=True)  # specify that the first field is the day
#df["date"] = pd.to_datetime(df["date"],format='%d/%m%/Y')  # specity a format
df["date"]

0   2020-12-01
1   2021-01-02
2   2021-02-03
3   2021-03-04
Name: date, dtype: datetime64[ns]

A datetime series has a ``dt`` attribute with useful tools for datetime handling:

In [54]:
df["date"].dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x107ba17b0>

In [55]:
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day"] = df["date"].dt.day
df

Unnamed: 0,date,val,year,month,day
0,2020-12-01,1.161897,2020,12,1
1,2021-01-02,0.300429,2021,1,2
2,2021-02-03,0.967956,2021,2,3
3,2021-03-04,-1.492499,2021,3,4


# Cleaning


Real data are often messy/unconsistend. 

A good portion of the data scientits' work is to clean and prepare data!

### Missing data

Missing values are generally represented with ``np.nan`` or None

In [56]:
df = pd.DataFrame(np.random.randn(6,5), columns=["A","B","C","D","E"])
df.iloc[0, 0] = np.nan
df.iloc[1, 1] = np.nan
df.iloc[2, 3] = np.nan
df.iloc[3, 3] = np.nan
df

Unnamed: 0,A,B,C,D,E
0,,0.175815,-0.357037,0.259905,-0.809283
1,-0.627117,,0.73981,-0.640156,0.515435
2,-1.08177,-0.766112,-0.877725,,0.295293
3,0.063231,0.422872,0.312515,,0.395372
4,1.244582,-0.123171,0.622985,-0.186772,0.211855
5,-0.075911,1.721565,0.20671,0.885027,0.105999


We can identify missing values using the ``isna`` method:

In [57]:
df.isna() # or df.isnull()

Unnamed: 0,A,B,C,D,E
0,True,False,False,False,False
1,False,True,False,False,False
2,False,False,False,True,False
3,False,False,False,True,False
4,False,False,False,False,False
5,False,False,False,False,False


The ``isna`` operation is often combined with the ``any`` aggregation:

In [58]:
df.isna().any() # columns where at least one element is missing
# df.isna().any(axis="columns") # rows where at least one element is missing

A     True
B     True
C    False
D     True
E    False
dtype: bool

### Missing data: counting columns/rows with missing values

As in the previous slide: let us look for columns with missing values

In [59]:
df.isna().any() # all columns with missing values

A     True
B     True
C    False
D     True
E    False
dtype: bool

Then, 3 columns (A, B, D) contain missing values. Two columns (C, E) have no missing.


If we had, say, 100 columns, if would be tedious to count them by hand! Solution: use ``value_counts``!

In [60]:
df.isna().any().value_counts() # count columns with missing/no missing values

True     3
False    2
Name: count, dtype: int64

We can use the same trick for rows:

In [61]:
df.isna().any(axis="columns")

0     True
1     True
2     True
3     True
4    False
5    False
dtype: bool

In [62]:
df.isna().any(axis="columns").value_counts()

True     4
False    2
Name: count, dtype: int64

Then, 4 rows contain missing values, 2 rows have no missing.

### Missing data handling

Several data analysis/machine learning models do not work with missing values. We may have to get rid of them!

* Sometimes the "right" thing to do is to drop incomplete rows:

In [63]:
df.dropna() # only keeps complete rows, equivalent to df.dropna(axis="index")

Unnamed: 0,A,B,C,D,E
4,1.244582,-0.123171,0.622985,-0.186772,0.211855
5,-0.075911,1.721565,0.20671,0.885027,0.105999


This makes sense when there are just a few rows with missing values

* Sometimes the "right" thing to do is to drop incomplete columns:

In [64]:
df.dropna(axis="columns") # only keeps complete columns

Unnamed: 0,C,E
0,-0.357037,-0.809283
1,0.73981,0.515435
2,-0.877725,0.295293
3,0.312515,0.395372
4,0.622985,0.211855
5,0.20671,0.105999


* Sometimes we may want to replace ``np.nan`` with a numeric value:

In [65]:
df.fillna(0.0) # replace all np.nans with 0.0

Unnamed: 0,A,B,C,D,E
0,0.0,0.175815,-0.357037,0.259905,-0.809283
1,-0.627117,0.0,0.73981,-0.640156,0.515435
2,-1.08177,-0.766112,-0.877725,0.0,0.295293
3,0.063231,0.422872,0.312515,0.0,0.395372
4,1.244582,-0.123171,0.622985,-0.186772,0.211855
5,-0.075911,1.721565,0.20671,0.885027,0.105999


This makes sense if 0.0 is a reasonable default value for missing information. Not always the case!

### Missing data handling cont'd
Tailored solutions for each case may be required.

In [66]:
df

Unnamed: 0,A,B,C,D,E
0,,0.175815,-0.357037,0.259905,-0.809283
1,-0.627117,,0.73981,-0.640156,0.515435
2,-1.08177,-0.766112,-0.877725,,0.295293
3,0.063231,0.422872,0.312515,,0.395372
4,1.244582,-0.123171,0.622985,-0.186772,0.211855
5,-0.075911,1.721565,0.20671,0.885027,0.105999


We may want to replace missing values with the column mean, with the column median, or with some kind of interpolation technique.

In [67]:
df['A'] = df['A'].fillna(df['A'].mean()) # for column A, replace missing values with the column mean
df['B'] = df['B'].fillna(df['B'].median()) # for column B, replace missing values with the column median
df['D'] = df['D'].fillna(method='ffill') # for column D, forward-fill with previous valid point

  df['D'] = df['D'].fillna(method='ffill') # for column D, forward-fill with previous valid point


In [68]:
df

Unnamed: 0,A,B,C,D,E
0,-0.095397,0.175815,-0.357037,0.259905,-0.809283
1,-0.627117,0.175815,0.73981,-0.640156,0.515435
2,-1.08177,-0.766112,-0.877725,-0.640156,0.295293
3,0.063231,0.422872,0.312515,-0.640156,0.395372
4,1.244582,-0.123171,0.622985,-0.186772,0.211855
5,-0.075911,1.721565,0.20671,0.885027,0.105999


There is no general rule for handling missing data. For instance, forward-fill interpolation makes sense for time series.

### Replacing Values

Sometimes a certain value, e.g. ``-1`` is used to code missing info in the dataset. The ``replace`` method of ``pd.Series`` comes in handy.

In [69]:
ser = pd.Series(np.random.rand(10))
ser[4] = -1
ser[8] = -1
ser

0    0.076572
1    0.885173
2    0.799358
3    0.957936
4   -1.000000
5    0.754726
6    0.539342
7    0.800408
8   -1.000000
9    0.231572
dtype: float64

In [70]:
ser.replace(-1, np.nan)

0    0.076572
1    0.885173
2    0.799358
3    0.957936
4         NaN
5    0.754726
6    0.539342
7    0.800408
8         NaN
9    0.231572
dtype: float64

### Dropping rows/columns

It is common to drop one or more columns from a dataframe:

In [71]:
df = pd.DataFrame(np.random.randint(1,10,size=(3,4)),columns=["Area","Population","Latitude","Longitude"],index=["Lugano","Geneva","Zurich"])
df

Unnamed: 0,Area,Population,Latitude,Longitude
Lugano,3,6,6,8
Geneva,8,7,8,7
Zurich,4,3,4,5


In [72]:
df.drop(["Population"], axis="columns") # drop also has the inplace option. 

Unnamed: 0,Area,Latitude,Longitude
Lugano,3,6,8
Geneva,8,8,7
Zurich,4,4,5


In [73]:
df.drop(["Population","Area"], axis="columns")

Unnamed: 0,Latitude,Longitude
Lugano,6,8
Geneva,8,7
Zurich,4,5


With a similar sintax, we can drop rows:

In [74]:
df.drop("Zurich") # or df.drop("Zurich",axis="rows") 

Unnamed: 0,Area,Population,Latitude,Longitude
Lugano,3,6,6,8
Geneva,8,7,8,7


### Renaming rows/columns

Fix some typos in rows and columns names. 

We can use the ``rename`` method of ``pd.DataFrame``:

In [75]:
df = pd.DataFrame(np.random.randint(1,10,size=(3,4)), columns=["Area","Popluation","Latitude","Longitude"], index=["Lugano","Geneva","Zuric"])
df

Unnamed: 0,Area,Popluation,Latitude,Longitude
Lugano,8,1,2,4
Geneva,3,1,1,5
Zuric,3,5,2,7


In [76]:
df = df.rename(columns={"Popluation": "Population"}) # rename some column names
df = df.rename(index={"Zuric": "Zurich"}) # rename some index names
df

Unnamed: 0,Area,Population,Latitude,Longitude
Lugano,8,1,2,4
Geneva,3,1,1,5
Zurich,3,5,2,7


In [77]:
df.index.name = "City" # the index itself may have a name...

In [78]:
df

Unnamed: 0_level_0,Area,Population,Latitude,Longitude
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lugano,8,1,2,4
Geneva,3,1,1,5
Zurich,3,5,2,7


### Renaming rows/columns

Alternatively, we can re-assign the row and column index of the dataframe: 

In [79]:
df = pd.DataFrame(np.random.randint(1, 10, size=(3 , 4)),
                  columns=["Area","Popluation","Latitude","Longitude"],
                  index=["Lugano","Geneva","Zuric"])
df

Unnamed: 0,Area,Popluation,Latitude,Longitude
Lugano,9,4,4,3
Geneva,7,5,1,9
Zuric,8,7,7,8


In [80]:
df.columns

Index(['Area', 'Popluation', 'Latitude', 'Longitude'], dtype='object')

In [81]:
df.columns = ['Area','Population','Latitude','Longitude']

In [82]:
df.index = ["Lugano","Geneva","Zurich"]

In [83]:
df

Unnamed: 0,Area,Population,Latitude,Longitude
Lugano,9,4,4,3
Geneva,7,5,1,9
Zurich,8,7,7,8


### Dropping duplicates

Duplicate rows may be suspicious. We can easily identify and remove them.

In [84]:
df = pd.DataFrame(np.array([[0,0], [0,1], [1,0], [1,1], [0,0]]), columns=["A", "B"])
df

Unnamed: 0,A,B
0,0,0
1,0,1
2,1,0
3,1,1
4,0,0


In [85]:
df.duplicated()

0    False
1    False
2    False
3    False
4     True
dtype: bool

In [86]:
df.drop_duplicates()

Unnamed: 0,A,B
0,0,0
1,0,1
2,1,0
3,1,1
