#### What are you trying to do in this notebook?
My task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. 

#### Why are you trying it?
The DNA segment ATATGGCCTT becomes A2T4G2C2. Can you use this lossy information to accurately predict bacteria species?

From this data we need to recover the genome fingerprint to find the bacteria. To classify 10 different bacteria species given genome sequencing data. This data has been compressed so that for instance ATATGGCCTT becomes A2T4G2C2.

For this challenge, we will be predicting bacteria species based on repeated lossy measurements of DNA snippets. Snippets of length 10 are analyzed using Raman spectroscopy that calculates the histogram of bases in the snippet.

Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities (e.g., A0T0G0C10 to A10T0G0C0), which then has a bias spectrum (of totally random ATGC) subtracted from the results.

The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tps-feb-22-lightautoml-pseudolabel/__results__.html
/kaggle/input/tps-feb-22-lightautoml-pseudolabel/lightautoml_tabularautoml.csv
/kaggle/input/tps-feb-22-lightautoml-pseudolabel/__resultx__.html
/kaggle/input/tps-feb-22-lightautoml-pseudolabel/notebook.css
/kaggle/input/tps-feb-22-lightautoml-pseudolabel/__notebook__.ipynb
/kaggle/input/tps-feb-22-lightautoml-pseudolabel/__output__.json
/kaggle/input/tps-feb-22-lightautoml-pseudolabel/custom.css
/kaggle/input/forest-of-extra-trees-0-9895-up-to-4th-place/submission__blend.csv
/kaggle/input/forest-of-extra-trees-0-9895-up-to-4th-place/Super_Model.png
/kaggle/input/forest-of-extra-trees-0-9895-up-to-4th-place/__results__.html
/kaggle/input/forest-of-extra-trees-0-9895-up-to-4th-place/submission__blend_2.csv
/kaggle/input/forest-of-extra-trees-0-9895-up-to-4th-place/__notebook_source__.ipynb
/kaggle/input/forest-of-extra-trees-0-9895-up-to-4th-place/submission__.csv
/kaggle/input/forest-of-extra-trees-0-9895-up-to-4th-place

In [2]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv')
sub = pd.read_csv('../input/early-ensemble/submission.csv')

In [3]:
cols = [e for e in test.columns if e not in ('row_id')]
train.drop_duplicates(subset=cols, keep='first',inplace = True)

In [4]:
print(train.shape)

(123993, 288)


In [5]:
s1 = pd.merge(train, test, how='inner', on=cols)

s1

Unnamed: 0,row_id_x,A0T0G0C10,A0T0G1C9,A0T0G2C8,A0T0G3C7,A0T0G4C6,A0T0G5C5,A0T0G6C4,A0T0G7C3,A0T0G8C2,...,A8T0G2C0,A8T1G0C1,A8T1G1C0,A8T2G0C0,A9T0G0C1,A9T0G1C0,A9T1G0C0,A10T0G0C0,target,row_id_y
0,12,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Escherichia_fergusonii,262823
1,360,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Salmonella_enterica,293245
2,1309,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Escherichia_fergusonii,275038
3,1865,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Escherichia_fergusonii,260922
4,2760,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Salmonella_enterica,229249
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
481,191663,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Streptococcus_pneumoniae,260248
482,197003,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Streptococcus_pneumoniae,252282
483,197003,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Streptococcus_pneumoniae,252660
484,197003,-9.536743e-07,-0.00001,-0.000043,-0.000114,-0.0002,-0.00024,-0.0002,-0.000114,-0.000043,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.00001,-0.00001,-0.00001,-9.536743e-07,Streptococcus_pneumoniae,279979


In [6]:
s1.row_id_y.nunique()

486

In [7]:
dic = {}
for i in range(len(s1)):
    dic[s1.loc[i]['row_id_y']] = s1.loc[i]['row_id_x']

In [8]:
len(dic)

486

In [9]:
for e in dic.items():
    print(e)

(262823, 12)
(293245, 360)
(275038, 1309)
(260922, 1865)
(229249, 2760)
(215626, 4158)
(224579, 4158)
(219742, 4355)
(279671, 4355)
(283531, 4355)
(292980, 4815)
(224689, 4934)
(217226, 5002)
(237965, 5002)
(284900, 5002)
(260875, 5765)
(280184, 5765)
(239258, 6020)
(251897, 6020)
(205077, 6085)
(267928, 6085)
(227381, 6849)
(223584, 6865)
(225575, 6865)
(296917, 6865)
(262055, 7025)
(208484, 7261)
(239353, 7688)
(287248, 7688)
(209825, 7887)
(274476, 7887)
(283789, 7887)
(214549, 7967)
(229583, 7967)
(210061, 8170)
(226481, 8281)
(270865, 8281)
(279772, 8281)
(207640, 8352)
(212675, 8352)
(240629, 8352)
(270697, 8381)
(299688, 8381)
(242948, 8502)
(274485, 8502)
(290336, 8502)
(285729, 8581)
(215254, 8678)
(255373, 8678)
(256553, 8678)
(261347, 8678)
(236109, 8717)
(247071, 8717)
(271614, 9976)
(200882, 9980)
(221964, 9980)
(242648, 10565)
(202955, 10626)
(249618, 10626)
(224357, 10663)
(290849, 10663)
(203168, 10666)
(257816, 11074)
(234821, 11243)
(293700, 11243)
(214412, 11680)
(21

In [10]:
for e in dic:
    sub.loc[sub[sub['row_id']==e].index.to_list(),'target'] = train.loc[train[train['row_id']==dic[e]].index.tolist()[0],'target']

In [11]:
sub.to_csv("submission.csv", index=False)

In [12]:
sub

Unnamed: 0,row_id,target
0,200000,Escherichia_fergusonii
1,200001,Salmonella_enterica
2,200002,Enterococcus_hirae
3,200003,Salmonella_enterica
4,200004,Staphylococcus_aureus
...,...,...
99995,299995,Streptococcus_pneumoniae
99996,299996,Bacteroides_fragilis
99997,299997,Bacteroides_fragilis
99998,299998,Bacteroides_fragilis


#### Did it work?
Target column is the target variable which consists of 10 kinds of bacteria Streptococcus_pyogenes, Salmonella_enterica, Enterococcus_hirae, Escherichia_coli, Campylobacter_jejuni, Streptococcus_pneumoniae, Staphylococcus_aureus, Escherichia_fergusonii, Bacteroides_fragilis, Klebsiella_pneumoniae.

Train dataset has 200,000 rows and 288 columns which contains286 features, 1 target variable target and 1 column of row_id.

Test dataset has 100,000 rows and 287 columns which contains286 features with1 column of row_id No missing values in train and test dataset.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
Classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count.