## Data Cleaning a CSV file Using Pandas

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data-cleaning-pandas.csv')
data

Unnamed: 0,Index,Age,Salary,Rating,Location,Established,Easy Apply
0,0,44.0,$44k-$99k,5.4,"India,In",1999,TRUE
1,1,66.0,$55k-$66k,3.5,"New York,Ny",2002,TRUE
2,2,,$77k-$89k,-1.0,"New York,Ny",-1,-1
3,3,64.0,$44k-$99k,4.4,India In,1988,-1
4,4,25.0,$44k-$99k,6.4,Australia Aus,2002,-1
5,5,44.0,$77k-$89k,1.4,"India,In",1999,TRUE
6,6,21.0,$44k-$99k,0.0,"New York,Ny",-1,-1
7,7,44.0,$44k-$99k,-1.0,Australia Aus,-1,-1
8,8,35.0,$44k-$99k,5.4,"New York,Ny",-1,-1
9,9,22.0,$44k-$99k,7.7,"India,In",-1,TRUE


In [220]:
#are there duplicate rows?

data.duplicated().sum()

0

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Index        29 non-null     int64  
 1   Age          22 non-null     float64
 2   Salary       29 non-null     object 
 3   Rating       28 non-null     float64
 4   Location     29 non-null     object 
 5   Established  29 non-null     int64  
 6   Easy Apply   29 non-null     object 
dtypes: float64(2), int64(2), object(3)
memory usage: 1.7+ KB


`data.info()` shows us an overview of the data.<br>
- The Index column is redundant as the pandas dataframe provides index values.
- From the above results, see missing values in the Age and Rating columns. <br>
- The data type of Salary and Easy Apply columns do not fit the standards and need to be modified. <br>
- The Rating, Established and Easy Apply columns have negative values which also need to be corrected. 
- The Location column needs to be split into two to display the full location name and code separately in an organised fashion. 

We will be creating a copy at every step of the data cleaning process for easier access to previous data and to undo errors without losing our progress. <br>

The data cleaning process will be tackled column-wise from left to right.

## Step 1: Index Column

The index column will be dropped due to its redundancy.

In [3]:
df1 = data.copy()
df1.head()

Unnamed: 0,Index,Age,Salary,Rating,Location,Established,Easy Apply
0,0,44.0,$44k-$99k,5.4,"India,In",1999,TRUE
1,1,66.0,$55k-$66k,3.5,"New York,Ny",2002,TRUE
2,2,,$77k-$89k,-1.0,"New York,Ny",-1,-1
3,3,64.0,$44k-$99k,4.4,India In,1988,-1
4,4,25.0,$44k-$99k,6.4,Australia Aus,2002,-1


In [10]:
del df1['Index']
df1.head()

Unnamed: 0,Age,Salary,Rating,Location,Established,Easy Apply
0,44.0,$44k-$99k,5.4,"India,In",1999,TRUE
1,66.0,$55k-$66k,3.5,"New York,Ny",2002,TRUE
2,,$77k-$89k,-1.0,"New York,Ny",-1,-1
3,64.0,$44k-$99k,4.4,India In,1988,-1
4,25.0,$44k-$99k,6.4,Australia Aus,2002,-1


## Step 2: Age Column

The Age column has null values which will be filled with measures of central tendency. 

In [11]:
df2 = df1.copy()

In [13]:
#finding out the rows where age is null
df2.loc[df1['Age'].isnull()]

Unnamed: 0,Age,Salary,Rating,Location,Established,Easy Apply
2,,$77k-$89k,-1.0,"New York,Ny",-1,-1
12,,$44k-$99k,0.0,"India,In",1999,-1
17,,$44k-$99k,5.3,"New York,Ny",1943,TRUE
20,,$44k-$99k,5.7,"New York,Ny",1944,TRUE
23,,$44k-$99k,2.4,"New York,Ny",1999,TRUE
26,,$55k-$66k,,"India,In",1934,TRUE
28,,$39k-$88k,3.4,Australia Aus,1932,-1


We can fill the null values in Age using the measures of central tendency in the entire Age column or use the `groupby` function to take into consideration other criteria such as Location, Salary etc while calculating the central tendency. <br> <br>
Since we are using only the Pandas library, we will have to use the above dataframe with null values to choose the criteria we want to apply to calculate the central tendency.<br> <br>
In regular uses, we can use pairplot to find correlation between two columns and increase our efficiency in determining the groupby criteria. In larger datasets, we can delete rows with null values if the % of data loss is not great.

In [15]:
#filling null age with the mode as per the country:

df2['Age'] = df2.groupby('Location')['Age'].transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else np.nan))

In [16]:
df2.head()

Unnamed: 0,Age,Salary,Rating,Location,Established,Easy Apply
0,44.0,$44k-$99k,5.4,"India,In",1999,TRUE
1,66.0,$55k-$66k,3.5,"New York,Ny",2002,TRUE
2,35.0,$77k-$89k,-1.0,"New York,Ny",-1,-1
3,64.0,$44k-$99k,4.4,India In,1988,-1
4,25.0,$44k-$99k,6.4,Australia Aus,2002,-1


In [212]:
#converting Age data type into integer 
df2['Age'] = df2['Age'].astype(int)
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          29 non-null     int32  
 1   Salary       29 non-null     object 
 2   Rating       28 non-null     float64
 3   Location     29 non-null     object 
 4   Established  29 non-null     int64  
 5   Easy Apply   29 non-null     object 
dtypes: float64(1), int32(1), int64(1), object(3)
memory usage: 1.4+ KB


## Step 3: Salary Column

The Salary column needs to be split up into lower and upper salary ranges

In [18]:
df3 = df2.copy()

In [43]:
salary_split = df3['Salary'].str.split('-', expand=True)
salary_split.head()


Unnamed: 0,0,1
0,$44k,$99k
1,$55k,$66k
2,$77k,$89k
3,$44k,$99k
4,$44k,$99k


In [44]:
# extract('(\d+)') helps to extract the numerical digits from the cells

df3['Lower_Salary_Range'] = salary_split[0].str.extract('(\d+)').astype(float) * 1000
df3.head()

Unnamed: 0,Age,Salary,Rating,Location,Established,Easy Apply,Lower_Salary_Range
0,44.0,$44k-$99k,5.4,"India,In",1999,TRUE,44000.0
1,66.0,$55k-$66k,3.5,"New York,Ny",2002,TRUE,55000.0
2,35.0,$77k-$89k,-1.0,"New York,Ny",-1,-1,77000.0
3,64.0,$44k-$99k,4.4,India In,1988,-1,44000.0
4,25.0,$44k-$99k,6.4,Australia Aus,2002,-1,44000.0


In [46]:
df3['Upper_Salary_Range'] = salary_split[1].str.extract('(\d+)').astype(float) * 1000

df3.head()

Unnamed: 0,Age,Salary,Rating,Location,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range
0,44.0,$44k-$99k,5.4,"India,In",1999,TRUE,44000.0,99000.0
1,66.0,$55k-$66k,3.5,"New York,Ny",2002,TRUE,55000.0,66000.0
2,35.0,$77k-$89k,-1.0,"New York,Ny",-1,-1,77000.0,89000.0
3,64.0,$44k-$99k,4.4,India In,1988,-1,44000.0,99000.0
4,25.0,$44k-$99k,6.4,Australia Aus,2002,-1,44000.0,99000.0


In [48]:
#deleting the original Salary column 
del df3['Salary']
df3.head()

Unnamed: 0,Age,Rating,Location,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range
0,44.0,5.4,"India,In",1999,TRUE,44000.0,99000.0
1,66.0,3.5,"New York,Ny",2002,TRUE,55000.0,66000.0
2,35.0,-1.0,"New York,Ny",-1,-1,77000.0,89000.0
3,64.0,4.4,India In,1988,-1,44000.0,99000.0
4,25.0,6.4,Australia Aus,2002,-1,44000.0,99000.0


## Step 4: Rating Column

The Rating column has negative and null values and needs to be rounded off for better organisation and interpretation.

In [49]:
df4 = df3.copy()

df4.info()

In [53]:
print('rating min:', df4['Rating'].agg('min'))
print('rating max:', df4['Rating'].agg('max'))

rating min: -1.0
rating max: 7.8


In [57]:
print('rating mean:', df4['Rating'].mean())
print('rating median:', df4['Rating'].median())
print('rating mode:', df4['Rating'].mode())

rating mean: 3.528571428571429
rating median: 4.2
rating mode: 0   -1.0
1    5.4
Name: Rating, dtype: float64


In [58]:
#replacing -1 values with the median

df4['Rating'] = df4['Rating'].replace(-1, df4['Rating'].median())
df4

Unnamed: 0,Age,Rating,Location,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range
0,44.0,5.4,"India,In",1999,TRUE,44000.0,99000.0
1,66.0,3.5,"New York,Ny",2002,TRUE,55000.0,66000.0
2,35.0,4.2,"New York,Ny",-1,-1,77000.0,89000.0
3,64.0,4.4,India In,1988,-1,44000.0,99000.0
4,25.0,6.4,Australia Aus,2002,-1,44000.0,99000.0
5,44.0,1.4,"India,In",1999,TRUE,77000.0,89000.0
6,21.0,0.0,"New York,Ny",-1,-1,44000.0,99000.0
7,44.0,4.2,Australia Aus,-1,-1,44000.0,99000.0
8,35.0,5.4,"New York,Ny",-1,-1,44000.0,99000.0
9,22.0,7.7,"India,In",-1,TRUE,44000.0,99000.0


In [59]:
#tackling null values

df4['Rating'].fillna(df4['Rating'].median(), inplace=True)
df4.tail()

Unnamed: 0,Age,Rating,Location,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range
24,13.0,4.2,"New York,Ny",1987,-1,44000.0,99000.0
25,55.0,0.0,Australia Aus,1980,TRUE,44000.0,99000.0
26,44.0,4.3,"India,In",1934,TRUE,55000.0,66000.0
27,52.0,5.4,"India,In",1935,-1,44000.0,99000.0
28,25.0,3.4,Australia Aus,1932,-1,39000.0,88000.0


In [65]:
#rounding off the rating values

df4['Rounded_Rating'] = df4['Rating'].apply(lambda x: round(x * 2) / 2)
df4.head()

Unnamed: 0,Age,Rating,Location,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range,Rounded_Rating
0,44.0,5.4,"India,In",1999,TRUE,44000.0,99000.0,5.5
1,66.0,3.5,"New York,Ny",2002,TRUE,55000.0,66000.0,3.5
2,35.0,4.2,"New York,Ny",-1,-1,77000.0,89000.0,4.0
3,64.0,4.4,India In,1988,-1,44000.0,99000.0,4.5
4,25.0,6.4,Australia Aus,2002,-1,44000.0,99000.0,6.5


## Step 5: Location Column

The Location column needs to be split and categorical labels need to be added for richer interpretation during analysis

In [66]:
df5 = df4.copy()

In [129]:
#splitting the location column
location_split = df4['Location'].str.split(',', n=1, expand=True)
location_split

Unnamed: 0,0,1
0,India,In
1,New York,Ny
2,New York,Ny
3,India In,
4,Australia Aus,
5,India,In
6,New York,Ny
7,Australia Aus,
8,New York,Ny
9,India,In


In [130]:
#addressing cells that could not be split due to absence of a comma(,)
location_split.loc[location_split[1].isnull()]

Unnamed: 0,0,1
3,India In,
4,Australia Aus,
7,Australia Aus,
13,Australia Aus,
14,Australia Aus,
15,Australia Aus,
25,Australia Aus,
28,Australia Aus,


In [131]:
#Creating variables to hold rows where column 0 ends with 'Aus' or 'In' and column 1 is null

aus_temp = (location_split[0].str.endswith('Aus')) & (location_split[1].isnull())
in_temp = (location_split[0].str.endswith('In')) & (location_split[1].isnull())

In [132]:
# inserting corresponding region code in column index 1
location_split.loc[aus_temp, 1] = 'Aus'

location_split.loc[in_temp, 1] = 'In'

location_split

Unnamed: 0,0,1
0,India,In
1,New York,Ny
2,New York,Ny
3,India In,In
4,Australia Aus,Aus
5,India,In
6,New York,Ny
7,Australia Aus,Aus
8,New York,Ny
9,India,In


In [133]:
#selecting remaining rows that end with region code 
location_split[location_split[0].str.endswith(('Aus', 'In'))]

Unnamed: 0,0,1
3,India In,In
4,Australia Aus,Aus
7,Australia Aus,Aus
13,Australia Aus,Aus
14,Australia Aus,Aus
15,Australia Aus,Aus
25,Australia Aus,Aus
28,Australia Aus,Aus


In [134]:
#removing the region code from the remainig rows

location_split.loc[location_split[0].str.endswith('Aus'), 0] = location_split.loc[location_split[0].str.endswith('Aus'), 0].str[:-3]

location_split.loc[location_split[0].str.endswith('In'), 0] = location_split.loc[location_split[0].str.endswith('In'), 0].str[:-2]

location_split

Unnamed: 0,0,1
0,India,In
1,New York,Ny
2,New York,Ny
3,India,In
4,Australia,Aus
5,India,In
6,New York,Ny
7,Australia,Aus
8,New York,Ny
9,India,In


In [135]:
#upper string for region code for visibility

location_split[1] = location_split[1].str.upper()
location_split.head()

Unnamed: 0,0,1
0,India,IN
1,New York,NY
2,New York,NY
3,India,IN
4,Australia,AUS


In [139]:
#merging the location_split df with the working df (version 2)

df4_v2 = pd.merge(df4, location_split, left_index=True, right_index=True, how='left')

df4_v2 = df4_v2.rename(columns={1: 'Region_Code', 0: 'Region'})

df4_v2


Unnamed: 0,Age,Rating,Location,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range,Rounded_Rating,Region,Region_Code
0,44.0,5.4,"India,In",1999,TRUE,44000.0,99000.0,5.5,India,IN
1,66.0,3.5,"New York,Ny",2002,TRUE,55000.0,66000.0,3.5,New York,NY
2,35.0,4.2,"New York,Ny",-1,-1,77000.0,89000.0,4.0,New York,NY
3,64.0,4.4,India In,1988,-1,44000.0,99000.0,4.5,India,IN
4,25.0,6.4,Australia Aus,2002,-1,44000.0,99000.0,6.5,Australia,AUS
5,44.0,1.4,"India,In",1999,TRUE,77000.0,89000.0,1.5,India,IN
6,21.0,0.0,"New York,Ny",-1,-1,44000.0,99000.0,0.0,New York,NY
7,44.0,4.2,Australia Aus,-1,-1,44000.0,99000.0,4.0,Australia,AUS
8,35.0,5.4,"New York,Ny",-1,-1,44000.0,99000.0,5.5,New York,NY
9,22.0,7.7,"India,In",-1,TRUE,44000.0,99000.0,7.5,India,IN


In [140]:
#deleting the original Location column for its not as effective anymore

del df4_v2['Location']
df4_v2.head(3)

Unnamed: 0,Age,Rating,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range,Rounded_Rating,Region,Region_Code
0,44.0,5.4,1999,TRUE,44000.0,99000.0,5.5,India,IN
1,66.0,3.5,2002,TRUE,55000.0,66000.0,3.5,New York,NY
2,35.0,4.2,-1,-1,77000.0,89000.0,4.0,New York,NY


In [146]:
#Adding labels (0, 1, 2) for Regions

region_num = {'IN': 0, 'NY': 1, 'AUS': 2}

df4_v2['Region_Number'] = df4_v2['Region_Code'].map(region_num)

df4_v2.head()

Unnamed: 0,Age,Rating,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range,Rounded_Rating,Region,Region_Code,Region_Number
0,44.0,5.4,1999,TRUE,44000.0,99000.0,5.5,India,IN,0
1,66.0,3.5,2002,TRUE,55000.0,66000.0,3.5,New York,NY,1
2,35.0,4.2,-1,-1,77000.0,89000.0,4.0,New York,NY,1
3,64.0,4.4,1988,-1,44000.0,99000.0,4.5,India,IN,0
4,25.0,6.4,2002,-1,44000.0,99000.0,6.5,Australia,AUS,2


## Step 6: Established Column

The Location column has negative values that need to be tackled

In [198]:
df6 = df4_v2.copy()

In [199]:
#rows where the value is -1 in the established column

df6[df6['Established'] == -1]

Unnamed: 0,Age,Rating,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range,Rounded_Rating,Region,Region_Code,Region_Number
2,35.0,4.2,-1,-1,77000.0,89000.0,4.0,New York,NY,1
6,21.0,0.0,-1,-1,44000.0,99000.0,0.0,New York,NY,1
7,44.0,4.2,-1,-1,44000.0,99000.0,4.0,Australia,AUS,2
8,35.0,5.4,-1,-1,44000.0,99000.0,5.5,New York,NY,1
9,22.0,7.7,-1,TRUE,44000.0,99000.0,7.5,India,IN,0


Here, just like how we used a criteria to fill null values in Age column, we can choose a column based on which we wantto fill the values in the Established column cells

In [200]:
# replacing -1 with null values to avoid impact on measures of central tendency

df6['Established'] = df6['Established'].replace(-1, None)

df6.head(7)

Unnamed: 0,Age,Rating,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range,Rounded_Rating,Region,Region_Code,Region_Number
0,44.0,5.4,1999.0,TRUE,44000.0,99000.0,5.5,India,IN,0
1,66.0,3.5,2002.0,TRUE,55000.0,66000.0,3.5,New York,NY,1
2,35.0,4.2,,-1,77000.0,89000.0,4.0,New York,NY,1
3,64.0,4.4,1988.0,-1,44000.0,99000.0,4.5,India,IN,0
4,25.0,6.4,2002.0,-1,44000.0,99000.0,6.5,Australia,AUS,2
5,44.0,1.4,1999.0,TRUE,77000.0,89000.0,1.5,India,IN,0
6,21.0,0.0,,-1,44000.0,99000.0,0.0,New York,NY,1


In [202]:
# Finding the measures of central tendency where the Lower Salary range is 44k

filter_40 = df6[df6['Lower_Salary_Range'] == 44000]

print("Median of 'Established' where lower salary range is 44000:", filter_40['Established'].median())
print("Mean of 'Established' where lower salary range is 44000:", filter_40['Established'].mean())
print("Mode of 'Established' where lower salary range is 44000:", filter_40['Established'].mode().values[0])

Median of 'Established' where lower salary range is 44000: 1987.0
Mean of 'Established' where lower salary range is 44000: 1978.0
Mode of 'Established' where lower salary range is 44000: 1999


In [203]:
# Finding the measures of central tendency where the Lower Salary range is 77k

filter_70 = df6[df6['Lower_Salary_Range'] == 77000]

print("Median of 'Established' where lower salary range is 77000:", filter_70['Established'].median())
print("Mean of 'Established' where lower salary range is 77000:", filter_70['Established'].mean())
print("Mode of 'Established' where lower salary range is 77000:", filter_70['Established'].mode().values[0])

Median of 'Established' where lower salary range is 77000: 1999.0
Mean of 'Established' where lower salary range is 77000: 1999.0
Mode of 'Established' where lower salary range is 77000: 1999


In [204]:
print('est min:', df6['Established'].agg('min'))
print('est max:', df6['Established'].agg('max'))

est min: 1932
est max: 2020


In [205]:
#Filling the null established values where lower salary =44k with median

df6.loc[df6['Lower_Salary_Range'] == 44000, 'Established'] = df6.loc[df6['Lower_Salary_Range'] == 44000, 'Established'].fillna(filter_40['Established'].median())

In [206]:
#Filling the null established values where lower salary =77k with median

df6.loc[df6['Lower_Salary_Range'] == 77000, 'Established'] = df6.loc[df6['Lower_Salary_Range'] == 77000, 'Established'].fillna(filter_70['Established'].median())

In [208]:
df6.head()

Unnamed: 0,Age,Rating,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range,Rounded_Rating,Region,Region_Code,Region_Number
0,44.0,5.4,1999.0,TRUE,44000.0,99000.0,5.5,India,IN,0
1,66.0,3.5,2002.0,TRUE,55000.0,66000.0,3.5,New York,NY,1
2,35.0,4.2,1999.0,-1,77000.0,89000.0,4.0,New York,NY,1
3,64.0,4.4,1988.0,-1,44000.0,99000.0,4.5,India,IN,0
4,25.0,6.4,2002.0,-1,44000.0,99000.0,6.5,Australia,AUS,2


In [213]:
#Converting the Established column from float to int

df6['Established'] = df6['Established'].astype(int)
df6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 29 non-null     int32  
 1   Rating              29 non-null     float64
 2   Established         29 non-null     int32  
 3   Easy Apply          29 non-null     object 
 4   Lower_Salary_Range  29 non-null     float64
 5   Upper_Salary_Range  29 non-null     float64
 6   Rounded_Rating      29 non-null     float64
 7   Region              29 non-null     object 
 8   Region_Code         29 non-null     object 
 9   Region_Number       29 non-null     int64  
dtypes: float64(4), int32(2), int64(1), object(3)
memory usage: 2.2+ KB


## Step 7: Easy Apply

The Easy Apply columns has unplaced negative values and needs to be in bool data type

In [215]:
df7 = df6.copy()

In [216]:
#replacing -1 with False
#Since the non -1 cells have the True value, we assume in good conscience that the other values must be False

df7['Easy Apply'] = df7['Easy Apply'].replace('-1', False)
df7["Easy Apply"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 29 entries, 0 to 28
Series name: Easy Apply
Non-Null Count  Dtype 
--------------  ----- 
29 non-null     object
dtypes: object(1)
memory usage: 360.0+ bytes


In [217]:
# converting the object datatype value of True to boolean value

df7['Easy Apply'] = df7['Easy Apply'].replace('TRUE', True)
df7['Easy Apply'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 29 entries, 0 to 28
Series name: Easy Apply
Non-Null Count  Dtype
--------------  -----
29 non-null     bool 
dtypes: bool(1)
memory usage: 157.0 bytes


In [218]:
df7

Unnamed: 0,Age,Rating,Established,Easy Apply,Lower_Salary_Range,Upper_Salary_Range,Rounded_Rating,Region,Region_Code,Region_Number
0,44,5.4,1999,True,44000.0,99000.0,5.5,India,IN,0
1,66,3.5,2002,True,55000.0,66000.0,3.5,New York,NY,1
2,35,4.2,1999,False,77000.0,89000.0,4.0,New York,NY,1
3,64,4.4,1988,False,44000.0,99000.0,4.5,India,IN,0
4,25,6.4,2002,False,44000.0,99000.0,6.5,Australia,AUS,2
5,44,1.4,1999,True,77000.0,89000.0,1.5,India,IN,0
6,21,0.0,1987,False,44000.0,99000.0,0.0,New York,NY,1
7,44,4.2,1987,False,44000.0,99000.0,4.0,Australia,AUS,2
8,35,5.4,1987,False,44000.0,99000.0,5.5,New York,NY,1
9,22,7.7,1987,True,44000.0,99000.0,7.5,India,IN,0


## Step 8: Exporting the cleaned dataset


In [221]:
clean_csv = df7.copy()

In [222]:
clean_csv.to_csv('cleaned_data.csv', index=False)

#it will save the csv file without index numbers