# Csv to Arff

In [1]:
import pandas as pd
df = pd.read_csv('adult.csv')
df

Unnamed: 0,age,workclass,education,marital-status,occupation,race,gender,hours-per-week,income
0,20-30,Private,11th,Never-married,Machine-op-inspct,Black,0,20-40,<=50K
1,30-40,Private,HS-grad,Married-civ-spouse,Farming-fishing,White,0,40-60,<=50K
2,20-30,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,White,0,20-40,>50K
3,40-50,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Black,0,20-40,>50K
4,30-40,Private,10th,Never-married,Other-service,White,0,20-40,<=50K
...,...,...,...,...,...,...,...,...,...
46028,20-30,Private,Assoc-acdm,Married-civ-spouse,Tech-support,White,1,20-40,<=50K
46029,30-40,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,White,0,20-40,>50K
46030,50-60,Private,HS-grad,Widowed,Adm-clerical,White,1,20-40,<=50K
46031,20-30,Private,HS-grad,Never-married,Adm-clerical,White,0,0-20,<=50K


### Get columns' title
Get columns' title has 2 ways.
```python
list(df.columns.values)
list(df)
```

In [2]:
attrs = list(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46033 entries, 0 to 46032
Data columns (total 9 columns):
age               46033 non-null object
workclass         46033 non-null object
education         46033 non-null object
marital-status    46033 non-null object
occupation        46033 non-null object
race              46033 non-null object
gender            46033 non-null int64
hours-per-week    46033 non-null object
income            46033 non-null object
dtypes: int64(1), object(8)
memory usage: 3.2+ MB


### Get columns' unique values

In [3]:
col_count = len(df.columns)
attr_vals = []
for i in attrs:
    attr_vals.append( sorted(df[i].unique().tolist()) )

In [4]:
attr_vals[0]

['0-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90']

### Create an .arff file and write in
Take a look at the format of .arff file. It's make up of a **file name**, serveral **attributes**, and **data**.
```js
@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
```

We use a string object to store the header first.

**1. File name**

In [5]:
header = ""
header += "@relation adult"

**2. Attributes**

In [6]:
header += "\n\n"
for idx, val in enumerate(attrs):
    string = f"@attribute {val}"
    string += " {"
    for v in attr_vals[idx]:
        if v != attr_vals[idx][-1]: # not end
            string += f"{v}, "
        else: 
            string += (str(v)+"}") # ends
    header += (string +"\n")

In [7]:
print(header)

@relation adult

@attribute age {0-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90}
@attribute workclass {Federal-gov, Local-gov, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay}
@attribute education {10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, Doctorate, HS-grad, Masters, Preschool, Prof-school, Some-college}
@attribute marital-status {Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed}
@attribute occupation {Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving}
@attribute race {Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White}
@attribute gender {0, 1}
@attribute hours-per-week {0-20, 20-40, 40-60, 60-80, 80-100}
@attribute income {<=50K, >50K}



**4. Write csv with header & data body**

In [8]:
with open('adult.csv') as in_file, open("adult.arff", 'w') as out_file:
    # Header
    out_file.write(header)
    out_file.write('\n@data\n')
    
    # Body, exclude csv title
    in_file.readline()
    for line in in_file.readlines():
        out_file.write(line)
        
#     out_file.write(in_file.read()) # This way will read all the file, and close it immediately.