Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS block corruption when running a new container #60

Closed
oguennec opened this issue Jun 9, 2015 · 1 comment
Closed

HDFS block corruption when running a new container #60

oguennec opened this issue Jun 9, 2015 · 1 comment

Comments

@oguennec
Copy link

oguennec commented Jun 9, 2015

I am systematically facing HDFS / HBase block corruption when running a new container from an image of a healthy HDP cluster (single-node).

Steps followed:

  • Creation of HDP cluster using sequenceiq/ambari Dockerfile. HDFS filesystem was healthy.
  • Stop all services in Ambari web
  • docker commit
  • docker run new container from image
  • Restart all services in Ambari,
  • HDFS systematically reports corrupt and missing block issues.

Example of corruption
-bash-4.1# HADOOP_USER_NAME=hdfs hdfs fsck /
Connecting to namenode via http://og.mycorp.com:50070
FSCK started by hdfs (auth:SIMPLE) from /172.17.0.2 for path / at Mon Jun 08 11:33:06 EDT 2015
.
/app-logs/ambari-qa/logs/application_1433328045348_0001/og.mycorp.com_45454: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741858

/app-logs/ambari-qa/logs/application_1433328045348_0001/og.mycorp.com_45454: MISSING 1 blocks of total size 7080 B..
/app-logs/ambari-qa/logs/application_1433502507205_0001/og.mycorp.com_45454: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741976_1154. Target Replicas is 3 but found 1 replica(s).
.
/app-logs/ambari-qa/logs/application_1433502507205_0002/og.mycorp.com_45454: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741988_1166. Target Replicas is 3 but found 1 replica(s).
.
/apps/hbase/data/data/default/ambarismoketest/.tabledesc/.tableinfo.0000000001: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741841_1017. Target Replicas is 3 but found 1 replica(s).
.
/apps/hbase/data/data/default/ambarismoketest/ac82f75a8636f78f9629dd4b480106d2/.regioninfo: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741842

/apps/hbase/data/data/default/ambarismoketest/ac82f75a8636f78f9629dd4b480106d2/.regioninfo: MISSING 1 blocks of total size 50 B..
/apps/hbase/data/data/default/ambarismoketest/ac82f75a8636f78f9629dd4b480106d2/family/0ade395e2a9b49b8a6ce711d482788d8: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741863_1039. Target Replicas is 3 but found 1 replica(s).
..
/apps/hbase/data/data/hbase/meta/.tabledesc/.tableinfo.0000000001: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741828

/apps/hbase/data/data/hbase/meta/.tabledesc/.tableinfo.0000000001: MISSING 1 blocks of total size 372 B..
/apps/hbase/data/data/hbase/meta/1588230740/.regioninfo: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741827_1003. Target Replicas is 3 but found 1 replica(s).
.
/apps/hbase/data/data/hbase/meta/1588230740/info/8420cae8bce94280995695060a910546: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073742145_1325. Target Replicas is 3 but found 1 replica(s).
..
/apps/hbase/data/data/hbase/namespace/.tabledesc/.tableinfo.0000000001: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741834

/apps/hbase/data/data/hbase/namespace/.tabledesc/.tableinfo.0000000001: MISSING 1 blocks of total size 286 B..
/apps/hbase/data/data/hbase/namespace/14115c2297e3486d8f3f4ebf785fd11d/.regioninfo: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741835_1011. Target Replicas is 3 but found 1 replica(s).
.
/apps/hbase/data/data/hbase/namespace/14115c2297e3486d8f3f4ebf785fd11d/info/418efc3186ad4896978913edf793cec4: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741861_1037. Target Replicas is 3 but found 1 replica(s).
..
/apps/hbase/data/hbase.id: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741826

/apps/hbase/data/hbase.id: MISSING 1 blocks of total size 42 B..
/apps/hbase/data/hbase.version: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741825_1001. Target Replicas is 3 but found 1 replica(s).
.
/apps/hbase/data/oldWALs/og.mycorp.com%2C60020%2C1433773404945.1433773424585: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073742140

/apps/hbase/data/oldWALs/og.mycorp.com%2C60020%2C1433773404945.1433773424585: MISSING 1 blocks of total size 655 B..
/apps/hbase/data/oldWALs/og.mycorp.com%2C60020%2C1433773404945.1433773750783.meta: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073742144

/apps/hbase/data/oldWALs/og.mycorp.com%2C60020%2C1433773404945.1433773750783.meta: MISSING 1 blocks of total size 541 B..
/hdp/apps/2.2.4.2-2/hive/hive.tar.gz: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741989

/hdp/apps/2.2.4.2-2/hive/hive.tar.gz: MISSING 1 blocks of total size 83000677 B..
/hdp/apps/2.2.4.2-2/mapreduce/hadoop-streaming.jar: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741991

/hdp/apps/2.2.4.2-2/mapreduce/hadoop-streaming.jar: MISSING 1 blocks of total size 104996 B..
/hdp/apps/2.2.4.2-2/mapreduce/mapreduce.tar.gz: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741829_1005. Target Replicas is 3 but found 1 replica(s).

/hdp/apps/2.2.4.2-2/mapreduce/mapreduce.tar.gz: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741830

/hdp/apps/2.2.4.2-2/mapreduce/mapreduce.tar.gz: MISSING 1 blocks of total size 58479639 B..
/hdp/apps/2.2.4.2-2/pig/pig.tar.gz: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741990_1168. Target Replicas is 3 but found 1 replica(s).
.
/hdp/apps/2.2.4.2-2/tez/tez.tar.gz: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741960_1138. Target Replicas is 3 but found 1 replica(s).
.
/mr-history/done/2015/06/03/000000/job_1433328045348_0001-1433328283077-ambari%2Dqa-word+count-1433328323621-1-1-SUCCEEDED-default-1433328302419.jhist: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741856

/mr-history/done/2015/06/03/000000/job_1433328045348_0001-1433328283077-ambari%2Dqa-word+count-1433328323621-1-1-SUCCEEDED-default-1433328302419.jhist: MISSING 1 blocks of total size 33669 B..
/mr-history/done/2015/06/03/000000/job_1433328045348_0001_conf.xml: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741857_1033. Target Replicas is 3 but found 1 replica(s).
.
/mr-history/done/2015/06/05/000000/job_1433502507205_0001-1433503933474-ambari%2Dqa-PigLatin%3ApigSmoke.sh-1433503964156-1-0-SUCCEEDED-default-1433503952122.jhist: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741974_1152. Target Replicas is 3 but found 1 replica(s).
.
/mr-history/done/2015/06/05/000000/job_1433502507205_0001_conf.xml: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741975

/mr-history/done/2015/06/05/000000/job_1433502507205_0001_conf.xml: MISSING 1 blocks of total size 227572 B..
/tmp/id11ac4100_date410315: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741840

/tmp/id11ac4100_date410315: MISSING 1 blocks of total size 1393 B..
/user/ambari-qa/mapredsmokeinput: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741847_1023. Target Replicas is 3 but found 1 replica(s).
..
/user/ambari-qa/mapredsmokeoutput/part-r-00000: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741854

/user/ambari-qa/mapredsmokeoutput/part-r-00000: MISSING 1 blocks of total size 1475 B..
/user/ambari-qa/passwd: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741977

/user/ambari-qa/passwd: MISSING 1 blocks of total size 1521 B...
/user/ambari-qa/pigsmoke.out/part-v000-o000-r-00000: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741987

/user/ambari-qa/pigsmoke.out/part-v000-o000-r-00000: MISSING 1 blocks of total size 207 B.Status: CORRUPT
Total size: 414441608 B
Total dirs: 8591
Total files: 35
Total symlinks: 0
Total blocks (validated): 31 (avg. block size 13369084 B)


CORRUPT FILES: 16
MISSING BLOCKS: 16
MISSING SIZE: 141860175 B
CORRUPT BLOCKS: 16


Minimally replicated blocks: 15 (48.387096 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 15 (48.387096 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 0.48387095
Corrupt blocks: 16
Missing replicas: 30 (32.258064 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Jun 08 11:33:06 EDT 2015 in 605 milliseconds

The filesystem under path '/' is CORRUPT
-bash-4.1#

@oguennec
Copy link
Author

I have solved this issue by adding the --volumes-from initial_container option when running the second container.

I had a closer look at the Dockerfile from sequenceiq/ambari Docker image and found out it contains a VOLUME /var/log instruction. Upon creation of the cluster HDP files were extensively saved in this location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant